Class OxidationStateData

java.lang.Object
_global.tri.oxidationstates.fitting.OxidationStateData

public class OxidationStateData extends Object
This is the main class for data sets (e.g. testing, training data).
  • Constructor Details

    • OxidationStateData

      public OxidationStateData(Collection<OxidationStateData.Entry> entries, String structDir)
      Create a data set with the provided entries
      Parameters:
      entries - The entries to included in this data set
      structDir - A directory that contains structure files for each in the entries, in VASP POSCAR format
    • OxidationStateData

      public OxidationStateData(String fileName, String structDir)
      Read a date set from a given file
      Parameters:
      fileName - The name of the given file
      structDir - A directory that contains structure files for each in the entries, in VASP POSCAR format
    • OxidationStateData

      public OxidationStateData(String fileName, boolean removeNonInteger, double hullCutoff, boolean removeZeroOxidation, boolean removeZintl, String structDir)
      Read a data set from a given file and remove entries according to the given options
      Parameters:
      fileName - The name of the given file
      removeNonInteger - Remove all entries that contain oxidation states with non-integer values
      hullCutoff - An energy in eV / atom. All entries with energies above the convex hull above this value will be removed.
      removeZeroOxidation - Remove entries with oxidaiton states of zero
      removeZintl - Remove entries that contain Zintl ions, as determined by the ZintlIonFinder
      structDir - A directory that contains structure files for each in the entries, in VASP POSCAR format
  • Method Details

    • writeFile

      public void writeFile(String fileName)
      Writes a file containing this data set
      Parameters:
      fileName - The name of the file to be written
    • writeFile

      public void writeFile(Writer writer) throws IOException
      Writes a file containing this data set
      Parameters:
      writer - The file will be written to this writer
      Throws:
      IOException - if there is an I/O error
    • removeRandomEntries

      public void removeRandomEntries(double percentToRemove)
      Removes a random subset of this data set
      Parameters:
      percentToRemove - The percent of entries to remove (rounded off).
    • copy

      public OxidationStateData copy()
      Returns a copy of this data set
      Returns:
      a copy of this data set
    • removeEntries

      public void removeEntries(OxidationStateData entriesToRemove)
      Removes the entries in the given set. Note that the entry objects need to be exactly the same; i.e. both this data set and the "entriesToRemove" data set should be derived from the some data set.
      Parameters:
      entriesToRemove - A data set containing the entries to be removed. Note that the entry objects need to be exactly the same; i.e. both this data set and the "entriesToRemove" data set should be derived from the some data set.
    • dataKeepOnlyIons

      public void dataKeepOnlyIons(Set<IonFactory.Ion> allowedIons)
      Only keep entries for which all ions are in the given set
      Parameters:
      allowedIons - Entries will only be kept if all ions in the entry are in this set.
    • splitData

      public OxidationStateData[] splitData(int numSplits)
      Randomly split the data into numSplits test sets. The union of all of the tests sets will be this complete data set, and all test sets will be approximately the same size. The split is done so that no composition will appear in more than one test set, so there is never the same composition in a test and training set.
      Parameters:
      numSplits - The number of test sets to generate
      Returns:
      An array of generated test sets
    • getStructDir

      public String getStructDir()
      Returns the directory with atomic structure files for the entries
      Returns:
      the directory with atomic structure files for the entries
    • getCountsByIon

      public HashMap<IonFactory.Ion,Integer> getCountsByIon()
      Returns a map in which the keys are the ions contained in this data set and the values are the number of entries that contain the corresponding ion.
      Returns:
      a map in which the keys are the ions contained in this data set and the values are the number of entries that contain the corresponding ion.
    • removeUncommonIonsByCount

      public void removeUncommonIonsByCount(int minAllowedCount)
      Removes all entries that contain a rate ions, where "rare" ions are those that appear in fewer than minAllowedCount entries
      Parameters:
      minAllowedCount - The minimum number of entries an ion must appear in to not be considered rare.
    • printNumEntriesByIon

      public void printNumEntriesByIon()
      Prints to standard output the number of entries containing each ion in this data set.
    • removeUncommonOxidationStates

      public void removeUncommonOxidationStates(double minAllowedFraction)
      Removes entries containing rare ions, where an ion is rare if the fraction of entries it appears in for its ion type is less than minAllowedFraction
      Parameters:
      minAllowedFraction - An ion will be considered rare if the fraction of entries it appears in for its ion type is less than this value. For example, if A2+ appears in 10 entries and A3+ appears in 90 entries, then all entries containing A2+ will be removed if minAllowedFraction is less than 0.1.
    • getUniqueEntries

      public HashMap<String,OxidationStateData.Entry> getUniqueEntries(boolean keepPolyIons)
      When entries with compositions written in terms of polyatomic ions are added to the data set, there will be two entries with the same ID: one with a composition written in terms of monatomic ions, and one with composition written in terms of polyatomic ions. This method removes of the two entries with the same ID. This method does not change the data set, but returns a map of the remaining entries keyed by entry ID.
      Parameters:
      keepPolyIons - If true, remove the entries with duplicate ID that have monatomic ions. If false, remove the entries with duplicate ID that have polyatomic ions.
      Returns:
      A map of the remaining entries keyed by entry ID.
    • removeEntriesWithZeroOxidationStates

      public void removeEntriesWithZeroOxidationStates()
      Removes all entries for which at least one of the ions has an oxidation state of zero.
    • removeGIIDecrease

      public void removeGIIDecrease(String structDirectory, String refStructDirectory, LikelihoodCalculator calculator)
      Removes all entries for which the GII in structDirectory is less than the GII in refStructDirectory. A tolerance of 1E-6 is used when comparing GIIs. TODO re-write this method so that just reads the GII from the entry (that field wasn't there when this was written).
      Parameters:
      structDirectory - A directory containing structures, where the description field gives the GII.
      refStructDirectory - A directory containing structures, where the description field gives the GII.
      calculator - A likelihood calculator used for logging purposes (tracking the likelihood score of the removed entries).
    • removeNonChargeBalancedStructures

      public void removeNonChargeBalancedStructures(String structDirName)
      Removes all entries that do not have charge neutral structures, defined as structures for which all of the oxidation states of the atoms in each unit cell add up to zero.
      Parameters:
      structDirName - The name of the directory containing the structure files in VASP POSCAR format.
    • removeEntriesWithZintlIons

      public void removeEntriesWithZintlIons()
      Removes all entries with ZintlIons, as determined by the ZintlIonFinder.
    • getKnownOxidationStates

      public HashMap<String,int[]> getKnownOxidationStates()
      Returns a map of oxidation states for each ion type in this data set. The map is keyed by the ion type ID and the values are the oxidation states, in ascending order.
      Returns:
      a map of oxidation states for each ion type in this data set. The map is keyed by the ion type ID and the values are the oxidation states, in ascending order.
    • removeEntriesWithNonIntegerStates

      public void removeEntriesWithNonIntegerStates()
      Removes all entries that contain oxidation states that are not within 0.01 of an integer.
    • removeEntriesWithNonIntegerStates

      public void removeEntriesWithNonIntegerStates(double tolerance)
      Removes all entries that contain oxidation states that are not within "tolerance" of an integer.
      Parameters:
      tolerance - The maximum allowed difference between the oxidation state and an integer to be considered an integer oxidation state.
    • removeStructuresNotNearHull

      public void removeStructuresNotNearHull(double energyAboveHull)
      Removes all structures with energy above hull greater than the provided value. If the energy above the hull is not defined, the entry is removed.
      Parameters:
      energyAboveHull - The minimum allowed energy above the hull, in eV / atom.
    • removeUnstableEntries

      public void removeUnstableEntries(double maxEnergyAboveHull)
      Removes all structures with energy above hull greater than the provided value. If the energy above the hull is not defined, the entry is not removed.
      Parameters:
      maxEnergyAboveHull - The minimum allowed energy above the hull, in eV / atom.
    • getMinIntegerOxidationState

      public int getMinIntegerOxidationState()
      Gets the lowest integer oxidation state in this data set, where any oxidation state within 0.01 of an integer is rounded to that integer.
      Returns:
      the lowest integer oxidation state in this data set, where any oxidation state within 0.01 of an integer is rounded to that integer.
    • getMaxIntegerOxidationState

      public int getMaxIntegerOxidationState()
      Gets the highest integer oxidation state in this data set, where any oxidation state within 0.01 of an integer is rounded to that integer.
      Returns:
      the highest integer oxidation state in this data set, where any oxidation state within 0.01 of an integer is rounded to that integer.
    • numEntries

      public int numEntries()
      The total number of entries in this data set.
      Returns:
      the total number of entries in this data set.
    • getEntry

      public OxidationStateData.Entry getEntry(int entryNum)
      Returns the "entryNum"'th entry in this data set.
      Parameters:
      entryNum - The index of the entry to be returned.
      Returns:
      the "entryNum"'th entry in this data set.
    • addEntry

      public OxidationStateData.Entry addEntry(String structureID, String composition, IonFactory.Ion[] ions, String[] sources, double energyAboveHull, double gii)
      Add an entry to this data set
      Parameters:
      structureID - The ID for this entry. Entries do not need to have unique IDs in the case of monatomic / polyatomic compositions for the same structure, but if non-unique IDs are used in other contexts some functionality might not work as expected.
      composition - The composition for this entry.
      ions - The ions (including oxidation states) in this entry.
      sources - Where this entry came from. Multiple sources are allowed.
      energyAboveHull - The energy above the convex hull, in eV / atom. Double.NaN if unknown.
      gii - The global instability index for this entry. Double.NaN if unknown.
      Returns:
      the entry that was added.
    • setEnergiesAboveHull

      public void setEnergiesAboveHull(Map<String,Double> energiesByID)
      Sets the energies above the hull for entries in the given map. The energies are set for all entries with the given ID, even if multiple entries share the same ID.
      Parameters:
      energiesByID - A map in which the key is an entry ID, the value is the energy above the hull in eV / atom, and the key is the entry ID. The energies are set for all entries with the given ID, even if multiple entries share the same ID.
    • writeLikelihoods

      public void writeLikelihoods(LikelihoodCalculator calculator, String fileName)
      Writes the calculated likelihood scores, along with information about the composition and ions, for all entries in this data set to the given file.
      Parameters:
      calculator - The calculator used to calculate the likelihood score.
      fileName - The name of the file to be written.