Class Main

java.lang.Object
_global.tri.oxidationstates.Main

public class Main extends Object
This is the main class containing the starting points for various oxidaiton analyzer routines.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static String
    This directory contains the data set from "Novel inorganic crystal structures predicted using autonomous simulation agents" (https://doi.org/10.1038/s41597-022-01438-8)
    static String
    This subdirectory contains structures with assigned oxidation states, as assigned by different methods.
    static String
    This is where we store the model parameters (oxidation state boundaries)
    static String
    All input and output files should be contained under this directory
    static String
    This contains all of the structures from the broad ICSD data set
    static String
    This is where we store the training data sets
  • Constructor Summary

    Constructors
    Constructor
    Description
     
  • Method Summary

    Modifier and Type
    Method
    Description
    static void
    addWebIonsToData(_global.tri.oxidationstates.Main.DataType inDataType, _global.tri.oxidationstates.Main.DataType outDataType)
    The "web ions" are the polyatomic ions used in the manuscript and web site.
    static void
    Reads the states calculated by BERTOS based on composition and assigns them to sites in a way that minimizes the GII.
    static void
    Reads the states calculated by PyMatGen and assigns them to sites in a way that minimizes the GII.
    static void
    cleanDataGII(_global.tri.oxidationstates.Main.DataType inType, _global.tri.oxidationstates.Main.DataType outType)
    Looks through all entries of the data set given by inType and determines whether the GII is lower than the GII for the ICSD (calculated by assinging ICSD states to sites in a way that minimizes the GII).
    static void
    compareAssignments(_global.tri.oxidationstates.Main.DataType dataType)
    This method prints out summary statistics comparing sets of oxidation states assignments (e.g.
    static void
    findIons(_global.tri.oxidationstates.Main.DataType dataType)
    Looks through the training set for the given data type to extract polyatomic ions, as determined by the OxideIonFinder, ZintlIonFinder, or CompositionIonFinder
    static void
    fitParameters(OxidationStateData data, int maxIterationsPerPass, double regParameter, String paramFileName, int numThreads)
    Fits the model parameters for all ions contained in the training data, randomly initializing the parameters to values between zero and 1
    static void
    fitParameters(OxidationStateData data, LikelihoodCalculator initCalculator, int maxIterationsPerPass, double regParameter, String paramFileName, int numThreads)
    Fits the model parameters for all ions contained in the training data, randomly initializing the parameters to values between zero and 1
    static void
    getCAMDAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean noPoly)
    Assign oxidation states to the data set used in the CAMD search for new materials.
    static void
    getCAMDDiscoveryCurve(_global.tri.oxidationstates.Main.DataType dataType)
    This prints the data showing the percent of compositions with stable structures found vs the percent of total compositions evaluated using the CAMD data sorted by Likelihood score.
    static void
    getCAMDHullHistogram(_global.tri.oxidationstates.Main.DataType dataType)
    Get the histogram of what percentage of structures are on the hull as a function of likelihood score for the CAMD data.
    getData(_global.tri.oxidationstates.Main.DataType dataType)
    Return the data set for the given data type.
    static String
    getDataFileName(_global.tri.oxidationstates.Main.DataType dataType)
    Return the file name for the given data set
    static Map<String,Double>
    Reads energies above the convex hull from a file extracted from the Materials Project and creates a Map (dictionary, for you Python folks) keyed by the icsd_id and with the energy above the hull as the value
    static void
    getICSDAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean reassign)
    Assigns oxidation states to all of the structures in the given dataset using oxidation states provided in the ICSD.
    static void
    Starting with a directory of CIF files extracted from the ICSD, select all charge calanced structures and use them to construct the initial training data file
    static String
    getIonStructureDirName(_global.tri.oxidationstates.Main.DataType dataType)
    Return the polyatomic ion structure directory name for the given data set
    getLikelihoodCalculator(_global.tri.oxidationstates.Main.DataType dataType)
    Return a likelihood calculator for the given data type.
    static String
    getParamFileName(_global.tri.oxidationstates.Main.DataType dataType)
    Returns the name of the fitted parameter file for the given data type.
    static String
    getPolyatomicIonDirName(_global.tri.oxidationstates.Main.DataType dataType)
    Return the name of directory of found polyatomic ions for the given data set
    static void
    getTrainingAssigments(_global.tri.oxidationstates.Main.DataType dataType, boolean frequency)
    Assigns oxidation states to all of the structures in the given dataset using the model trained on that dataset and writes out corresponding CIF files to a directory.
    static void
    getValidationAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean frequency)
    Assigns oxidation states calculated using 10-fold cross validation to all of the structures in the given dataset and writes out corresponding CIF files to a directory.
    static String
    Returns the directory name for the polyatomic ions used for the paper and web site
    static void
    groupIons(_global.tri.oxidationstates.Main.DataType dataType, int minOccurrences, int numSamples)
    Places ions of the same composition into groups of structurally similar ions, where the atomic oxidation states of all ions in the set need to match.
    static void
    groupIonsByOxidationState(double minAllowedFraction, boolean removeZeroes, int minAllowedOccurrences)
    Place the mean structures found by groupIons(_global.tri.oxidationstates.Main.DataType,int,int) in groups by total oxidation state, calculated by adding the oxidation states of all atoms in the ion.
    static void
    Loads the polyatomic ions used for the web site and paper to the IonFactory.
    static void
    main(String[] args)
    This is the main entry point for the program.
    static void
    prepareDataFromICSD(_global.tri.oxidationstates.Main.DataType outDataType)
    Reads a directory of CIF files exported from the ICSD and builds an initial training data file for all ordered, charge-balanced structures.
    static void
    printAllKnownStates(_global.tri.oxidationstates.Main.DataType dataType)
    Write all of the distinct oxidation states in a given data set
    static void
    printElectrochemicalSeries(_global.tri.oxidationstates.Main.DataType dataType)
    Generate a table for the electrochemical series, consisting of boundaries for all redox pairs in the data set.
    static void
    removeRareIons(_global.tri.oxidationstates.Main.DataType inDataType, _global.tri.oxidationstates.Main.DataType outDataType)
    Removes all entries with rare ions from a data set, where an ion is considered "rare" if it appears in fewer than 25 entries.
    static void
    splitData(OxidationStateData data, int numSplits, String outDirName)
    Randomly splits the data for leave-k-out cross validation in a way that ensures that no composition appears in both the test and training set for any split.
    Reads an oxidation state prediction as output by BERTOS and converts it into an OxidationStateSet for use with this code.
    static void
    This method provides an example of how to call the API to replicate the table-generating functionality of the web app.
    static void
    This method is designed for use in a command-line interface to fit the model.
    static void
    writeBoundaryJSON(_global.tri.oxidationstates.Main.DataType dataType)
    Write the JSON file containing the oxidation state boundaries.
    static void
    writeDataFilesForVisualization(_global.tri.oxidationstates.Main.DataType dataType, boolean smoothCutoff)
    Geneate the files we use to generate the wavy bar plots (or equivalent straight line plots).
    static void
    Write files in xyz format for each of the polyatomic ions used for the web site nad paper

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • ROOT_DIR

      public static String ROOT_DIR
      All input and output files should be contained under this directory
    • STRUCT_DIR

      public static String STRUCT_DIR
      This contains all of the structures from the broad ICSD data set
    • TRAINING_DATA_DIR

      public static String TRAINING_DATA_DIR
      This is where we store the training data sets
    • PARAMETER_DIR

      public static String PARAMETER_DIR
      This is where we store the model parameters (oxidation state boundaries)
    • GII_DIR

      public static String GII_DIR
      This subdirectory contains structures with assigned oxidation states, as assigned by different methods. It also should contain files for oxidation state assignments output from pymatgen and BERTOS.
    • CAMD_DIR

      public static String CAMD_DIR
      This directory contains the data set from "Novel inorganic crystal structures predicted using autonomous simulation agents" (https://doi.org/10.1038/s41597-022-01438-8)
  • Constructor Details

    • Main

      public Main()
  • Method Details

    • main

      public static void main(String[] args)
      This is the main entry point for the program. Other routines are called from here.
      Parameters:
      args - Command line arguments
    • trainModel

      public static void trainModel(String[] args)
      This method is designed for use in a command-line interface to fit the model.
      Parameters:
      args - The arguments that can be passed to this method. They will be parsed as follows: args[0]: The name of the file containing the training data. args[1]: The name of the file to be written containing parameters of the trained model. The file name should start with "parameters". A corresponding file containing boundary values will also be written, where the name of the boundaries file will be name of the parameters file with "parameters" replaced with "boundaries". args[2]: The name of the directory containing the atomic structures of allowed polyatomic ions. args[3] (optional): This many optimization steps will be run before updating the files "boundaries.txt" and "parameters.txt" and restarting the optimization from the written values. The default value is 1000. args[4] (optional): The parameter to be multiplied by the sum of the spreads in the minimum and maximum boundary values for regularization. The default value is 0. args[5] (optional): The number of threads to run simultaneously for the optimization. The default value is 1.
    • fitParameters

      public static void fitParameters(OxidationStateData data, int maxIterationsPerPass, double regParameter, String paramFileName, int numThreads)
      Fits the model parameters for all ions contained in the training data, randomly initializing the parameters to values between zero and 1
      Parameters:
      data - The training data
      maxIterationsPerPass - After this many optimization steps have been performed, the code will write the current parameters to a file and restart optimization from those parameters.
      regParameter - The regularization parameter to determine how much weight to give the regularization term.
      paramFileName - The name of the output file containing the parameters. The file name should start with "parameters". A separate file containing the boundary values will also be written, where the name of the boundary file will be the same as the name of the parameters file, with the word "parameters" replaced by the word "boundaries".
      numThreads - The number of threads to use when training the model using parallel processing.
    • fitParameters

      public static void fitParameters(OxidationStateData data, LikelihoodCalculator initCalculator, int maxIterationsPerPass, double regParameter, String paramFileName, int numThreads)
      Fits the model parameters for all ions contained in the training data, randomly initializing the parameters to values between zero and 1
      Parameters:
      data - The training data
      initCalculator - The Likelihood calculator containing an initial version of the model
      maxIterationsPerPass - After this many optimization steps have been performed, the code will write the current parameters to a file and restart optimization from those parameters.
      regParameter - The regularization parameter to determine how much weight to give the regularization term.
      paramFileName - The name of the output file containing the parameters. The file name should start with "parameters". A separate file containing the boundary values will also be written, where the name of the boundary file will be the same as the name of the parameters file, with the word "parameters" replaced by the word "boundaries".
      numThreads - The number of threads to use when training the model using parallel processing.
    • prepareDataFromICSD

      public static void prepareDataFromICSD(_global.tri.oxidationstates.Main.DataType outDataType)
      Reads a directory of CIF files exported from the ICSD and builds an initial training data file for all ordered, charge-balanced structures. The CIF files should be in ICSD_CIFS/All CIFs/. POSCAR-formatted structures will be written to the Structures directory.
      Parameters:
      outDataType - The dataType for the generated training data
    • findIons

      public static void findIons(_global.tri.oxidationstates.Main.DataType dataType)
      Looks through the training set for the given data type to extract polyatomic ions, as determined by the OxideIonFinder, ZintlIonFinder, or CompositionIonFinder
      Parameters:
      dataType - The data set to search through.
    • groupIons

      public static void groupIons(_global.tri.oxidationstates.Main.DataType dataType, int minOccurrences, int numSamples)
      Places ions of the same composition into groups of structurally similar ions, where the atomic oxidation states of all ions in the set need to match.
      Parameters:
      dataType - The data set for which we are grouping the ions
      minOccurrences - The method will ignore any compositions for which the number of found ions is less than this value. This is useful for screening out rare, very large ions that take a long time to group.
      numSamples - The number of structures to sample when calculating the mean structure for the group
    • groupIonsByOxidationState

      public static void groupIonsByOxidationState(double minAllowedFraction, boolean removeZeroes, int minAllowedOccurrences)
      Place the mean structures found by groupIons(_global.tri.oxidationstates.Main.DataType,int,int) in groups by total oxidation state, calculated by adding the oxidation states of all atoms in the ion. The atomic oxidation states do not need to match each other for ions to be placed in the same group. For each group, selects a representative structure that is structurally similar to all other ions in the group.
      Parameters:
      minAllowedFraction - A representative structure will only be generated if the ratio of the number of structurally similar ions (ignoring total oxidation states) to the total number of ions with the same composition is at least this.
      removeZeroes - True if any ions with zero oxidation state should be removed
      minAllowedOccurrences - A representative structure will only be generated if the ratio of the number of structurally similar ions (ignoring total oxidation states) to the total number of ions with the same composition is greater than this.
    • getInitialTrainingData

      public static void getInitialTrainingData()
      Starting with a directory of CIF files extracted from the ICSD, select all charge calanced structures and use them to construct the initial training data file
    • getEnergiesAboveHull

      public static Map<String,Double> getEnergiesAboveHull()
      Reads energies above the convex hull from a file extracted from the Materials Project and creates a Map (dictionary, for you Python folks) keyed by the icsd_id and with the energy above the hull as the value
      Returns:
      A Map keyed by the icsd_id and with the energy above the hull as the value.
    • addWebIonsToData

      public static void addWebIonsToData(_global.tri.oxidationstates.Main.DataType inDataType, _global.tri.oxidationstates.Main.DataType outDataType)
      The "web ions" are the polyatomic ions used in the manuscript and web site. This method identifies all entries in the "inDataType" data set that contain these ions, creates a new version of that entry that contains the composition written in terms of polyatomic ions, and adds the new entry to the data set. The new data set is written to the file corresponding to "outDataType". TODO this whole method should probably be added to the OxidationStateData class.
      Parameters:
      inDataType - The data set to which entries with polyatomic ions should be added
      outDataType - The name of the new data set
    • removeRareIons

      public static void removeRareIons(_global.tri.oxidationstates.Main.DataType inDataType, _global.tri.oxidationstates.Main.DataType outDataType)
      Removes all entries with rare ions from a data set, where an ion is considered "rare" if it appears in fewer than 25 entries. The ion removal process is run repeatedly until no more ions are removed.
      Parameters:
      inDataType - The data set from which ions should be removed
      outDataType - The new data set to be written.
    • getValidationAssignments

      public static void getValidationAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean frequency)
      Assigns oxidation states calculated using 10-fold cross validation to all of the structures in the given dataset and writes out corresponding CIF files to a directory. The description of each CIF file will be the GII for that assignment, and oxidation states will be assigned to sites in a way that minimizes the GII.
      Parameters:
      dataType - The data set to be used. Fitted parameters will be read for this data set when calculated the Likelihood Score.
      frequency - True if the Frequency Score should be used to assign oxidation states, false if the Likelihood score should be used.
    • getTrainingAssigments

      public static void getTrainingAssigments(_global.tri.oxidationstates.Main.DataType dataType, boolean frequency)
      Assigns oxidation states to all of the structures in the given dataset using the model trained on that dataset and writes out corresponding CIF files to a directory. The description of each CIF file will be the GII for that assignment, and oxidation states will be assigned to sites in a way that minimizes the GII.
      Parameters:
      dataType - The data set to be used. Fitted parameters will be read for this data set when calculated the Likelihood Score.
      frequency - True if the Frequency Score should be used to assign oxidation states, false if the Likelihood score should be used.
    • getICSDAssignments

      public static void getICSDAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean reassign)
      Assigns oxidation states to all of the structures in the given dataset using oxidation states provided in the ICSD. The description of each CIF file will be the GII for that assignment.
      Parameters:
      dataType - The data set to be used. Fitted parameters will be read for this data set when calculated the Likelihood Score.
      reassign - True if the oxidation states should be reassigned to sites in a way the minimizes the GII, which is usually (but not always) how they are assigned in the ICSD. False if we should just use the ICSD assignments.
    • assignBertosStates

      public static void assignBertosStates()
      Reads the states calculated by BERTOS based on composition and assigns them to sites in a way that minimizes the GII. The assigned states are used to calculate the GII
    • stateSetFromBertosString

      public static OxidationStateSet stateSetFromBertosString(String bertosString)
      Reads an oxidation state prediction as output by BERTOS and converts it into an OxidationStateSet for use with this code.
      Parameters:
      bertosString - A set of oxidation states for a given composition as calculated using BERTOS
      Returns:
      An OxidationStateSet representing the assignment in the bertosString.
    • calcGIIForPymatgenStructs

      public static void calcGIIForPymatgenStructs()
      Reads the states calculated by PyMatGen and assigns them to sites in a way that minimizes the GII. The assigned states are used to calculate the GII
    • cleanDataGII

      public static void cleanDataGII(_global.tri.oxidationstates.Main.DataType inType, _global.tri.oxidationstates.Main.DataType outType)
      Looks through all entries of the data set given by inType and determines whether the GII is lower than the GII for the ICSD (calculated by assinging ICSD states to sites in a way that minimizes the GII). A new data set, outType, is created with such entries removed. Rare ions are also removed. If an To facilitate calculations and analysis, additional files are written listing the "remainder" entries (those that were removed) based on GII and rare ions. Test-training splits for leave-10-out cross-validation are also generated.
      Parameters:
      inType - The data set that we are cleaning
      outType - The cleaned data set
    • splitData

      public static void splitData(OxidationStateData data, int numSplits, String outDirName)
      Randomly splits the data for leave-k-out cross validation in a way that ensures that no composition appears in both the test and training set for any split.
      Parameters:
      data - The data set to be split
      numSplits - The number of splits to generate (What "k" is in leave-k-out cross validation).
      outDirName - Where to write the splits being generated
    • compareAssignments

      public static void compareAssignments(_global.tri.oxidationstates.Main.DataType dataType)
      This method prints out summary statistics comparing sets of oxidation states assignments (e.g. from different methods for predicting oxidation states) for a given data set.
      Parameters:
      dataType - The data set for which we are comparing assignments.
    • getCAMDAssignments

      public static void getCAMDAssignments(_global.tri.oxidationstates.Main.DataType dataType, boolean noPoly)
      Assign oxidation states to the data set used in the CAMD search for new materials.
      Parameters:
      dataType - The data set that we will use to generate the assignments (we will use parameters fit to this data).
      noPoly - True if we should write the CAMD compositions in terms of polyatomic ions, false otherwise.
    • getCAMDDiscoveryCurve

      public static void getCAMDDiscoveryCurve(_global.tri.oxidationstates.Main.DataType dataType)
      This prints the data showing the percent of compositions with stable structures found vs the percent of total compositions evaluated using the CAMD data sorted by Likelihood score.
      Parameters:
      dataType - The data set for which analysis should be performed. (The oxidation analyzer was trained using this data set.)
    • getCAMDHullHistogram

      public static void getCAMDHullHistogram(_global.tri.oxidationstates.Main.DataType dataType)
      Get the histogram of what percentage of structures are on the hull as a function of likelihood score for the CAMD data.
      Parameters:
      dataType - THe data set used to parameterize the oxidation predictions.
    • writeBoundaryJSON

      public static void writeBoundaryJSON(_global.tri.oxidationstates.Main.DataType dataType)
      Write the JSON file containing the oxidation state boundaries.
      Parameters:
      dataType - The data set for which we will write the boundaries.
    • writeXYZFiles

      public static void writeXYZFiles()
      Write files in xyz format for each of the polyatomic ions used for the web site nad paper
    • loadWebIons

      public static void loadWebIons()
      Loads the polyatomic ions used for the web site and paper to the IonFactory. This should usually be called first, as the code frequently checks IonFactory to get the list of known polyatomic ions. It's OK to call this multiple times on the same directory of ions.
    • getWebIonDirName

      public static String getWebIonDirName()
      Returns the directory name for the polyatomic ions used for the paper and web site
      Returns:
      the directory name for the polyatomic ions used for the paper and web site
    • printAllKnownStates

      public static void printAllKnownStates(_global.tri.oxidationstates.Main.DataType dataType)
      Write all of the distinct oxidation states in a given data set
      Parameters:
      dataType - The data set for which we should print out the oxidation states
    • printElectrochemicalSeries

      public static void printElectrochemicalSeries(_global.tri.oxidationstates.Main.DataType dataType)
      Generate a table for the electrochemical series, consisting of boundaries for all redox pairs in the data set. In general it will not be sorted, so you may want to sort the output.
      Parameters:
      dataType - The data set for which we will print out the electrochemical series.
    • writeDataFilesForVisualization

      public static void writeDataFilesForVisualization(_global.tri.oxidationstates.Main.DataType dataType, boolean smoothCutoff)
      Geneate the files we use to generate the wavy bar plots (or equivalent straight line plots).
      Parameters:
      dataType - The data set for which we are generating the plots.
      smoothCutoff - True if the plots should show the logistic function cutoff, and false if they should just show the mean boundary values.
    • getData

      public static OxidationStateData getData(_global.tri.oxidationstates.Main.DataType dataType)
      Return the data set for the given data type.
      Parameters:
      dataType - The type of data set we want
      Returns:
      The data set containing entries with compositions and oxidation states.
    • getDataFileName

      public static String getDataFileName(_global.tri.oxidationstates.Main.DataType dataType)
      Return the file name for the given data set
      Parameters:
      dataType - The data set we are interested in
      Returns:
      The file name
    • getIonStructureDirName

      public static String getIonStructureDirName(_global.tri.oxidationstates.Main.DataType dataType)
      Return the polyatomic ion structure directory name for the given data set
      Parameters:
      dataType - The data set we are interested in
      Returns:
      The directory name
    • getPolyatomicIonDirName

      public static String getPolyatomicIonDirName(_global.tri.oxidationstates.Main.DataType dataType)
      Return the name of directory of found polyatomic ions for the given data set
      Parameters:
      dataType - The data set we are interested in
      Returns:
      The directory name
    • getLikelihoodCalculator

      public static LikelihoodCalculator getLikelihoodCalculator(_global.tri.oxidationstates.Main.DataType dataType)
      Return a likelihood calculator for the given data type.
      Parameters:
      dataType - The data set we are interested in.
      Returns:
      A likelihood calculator using parameters trained on the given data set.
    • getParamFileName

      public static String getParamFileName(_global.tri.oxidationstates.Main.DataType dataType)
      Returns the name of the fitted parameter file for the given data type. This method assumes a regularization parameter of 5E-6.
      Parameters:
      dataType - The given data type
      Returns:
      the name of the fitted parameter file for the given data type
    • testWebAPI

      public static void testWebAPI()
      This method provides an example of how to call the API to replicate the table-generating functionality of the web app.