Digichem Database

The digichem database program is used to manage databases of completed calculation results. It consists of of several sub-commands, depending on what you want to do to the database:

$ digichem database FILE SUB_COMMAND

FILE is the database file you want to act upon. It can be a path to a database file:

$ digichem database results.db SUB_COMMAND

Or the name of a database pre-configured into Digichem. You can configure as many databases as you like, but by default the only configured database that is available is the main database:

$ digichem database main SUB_COMMAND

`digichem database insert`

The insert sub-command is used to add completed calculation results to a database:

$ digichem database results.db insert FILES

Where FILES are completed calculation log files:

$ digichem database results.db insert Benzene.log Pyridine.log

The linux wildcard character * can be used as normal to insert many calculation results at once:

// Insert all calculation log files in the current folder.
$ digichem database results.db insert *.log

For example, to insert all the results from a computational screen submitted with Digichem with the following structure:

Benzene/
    Gaussian 16 Optimisation Frequencies PBE1PBE (GD3BJ) Toluene 6-31G(d,p)/
    Gaussian 16 Excited States TDA 10 Singlets 10 Triplets PBE1PBE (GD3BJ) Toluene 6-31G(d,p)/
Pyridine/
    Gaussian 16 Optimisation Frequencies PBE1PBE (GD3BJ) Toluene 6-31G(d,p)
    Gaussian 16 Excited States TDA 10 Singlets 10 Triplets PBE1PBE (GD3BJ) Toluene 6-31G(d,p)/
Toluene/
    Gaussian 16 Optimisation Frequencies PBE1PBE (GD3BJ) Toluene 6-31G(d,p)
    Gaussian 16 Excited States TDA 10 Singlets 10 Triplets PBE1PBE (GD3BJ) Toluene 6-31G(d,p)/

Use the following command:

$ digichem database results.db insert */*

Digichem will generate a unique ID for each calculation that is inserted into the database, and these will be printed at the end of the command:

$ digichem database results.db insert Benzene.log
4341561273e3e48d80f5427212e785f0d7044835

For a given calculation, this ID is always the same (even if inserting into a different database), and Digichem will not allow an identical calculation to be inserted into the same database multiple times. If you try to do this, Digichem will print a warning:

// Try to insert the same file twice
$ digichem database results.db insert Benzene.log Benzene.log
digichem: WARNING: UserWarning: Not inserting 4341561273e3e48d80f5427212e785f0d7044835, Benzene, Gaussian (2016+C.01), Optimisation, Frequencies, DFT; Document with ID 4341561273e3e48d80f5427212e785f0d7044835 already exists
4341561273e3e48d80f5427212e785f0d7044835

Note

Digichem is extremely strict on what it considers an ‘identical’ calculation. In short, the calculation log files must match exactly. This means two calculations submitted at different times, even if otherwise identical, will not be considered the same (because the completion time is included in most log files).

`digichem database slice`

While insert is used to add data from computational log files, slice is used to insert data from another database.

$ digichem database source.db slice destination.db QUERIES

Where QUERIES are a number of search queries specifying what to copy. If none are given, then the entirety of source.db is copied to destination.db:

$ digichem database source.db slice destination.db

In this case, the -N option can be specified to speed up the copying process:

$ digichem database source.db slice destination.db -N

Please see the command reference for more information on the -N option.

The search QUERIES are given in the same way as for the digichem database search sub-command. Only those calculation results that match the search criteria will be copied:

// Only copy results with a HOMO value greater than -6.0 eV.
$ digichem database source.db slice destination.db --homo ">-6"

`digichem database search`

The search sub-command is used to search for (and extract) results from a database:

$ digichem database main search QUERIES

Where QUERIES are a number of search queries specifying what to search for. If none are given, the entire database is returned:

// Export everything from the database.
$ digichem database main search
_id: 4341561273e3e48d80f5427212e785f0d7044835
atoms:
    alignment_method: Minimal
    charge: 0
    exact_mass:
        units: g mol^-1
        value: 78.04694999999998
    formula: C6H6
    linearity_ratio: 0.07920370892421824
    molar_mass:
        units: g mol^-1
        value: 78.11184
... // This will go on for some time...

Queries can be specified in a number of ways. For common search criteria (orbital energies, excited states etc.), a command line argument can be used. For example, to search for all calculation results with a HOMO above (more positive) than -5.0 eV:

// Export everything from the database.
$ digichem database main search --homo ">-5.0"

Or to search for all calculations on Benzene:

// Search for calculations matching the SMILES for benzene (exact match).
$ digichem database main search --structure "c1ccccc1"

Substructure searching is also possible, using the --substructure option (which also matches against a SMILES string):

// Search for calculations on molecules containing the benzene subunit.
$ digichem database main search --substructure "c1ccccc1"

Queries can be freely intermixed:

// Search for calculations on benzene with a HOMO above -5.0 eV.
$ digichem database main search --structure "c1ccccc1" --homo ">-5.0"

Please see the section on queries for more information.

digichem database search can write to all the same file formats as the digichem result command, and in fact takes the same format options (-y, -j, -c, -t, -d, -a, and -s).

By default, the program will export to the Digichem native format (-y). The table (-t) format is particularly useful for viewing database results when combined with the less command (to provide scrolling):

// View the entire database at a glance.
$ digichem database main search -t  | less -S

digichem database search can also easily export data to a file:

// Export the database to csv.
$ digichem database main search -c -O db_dump.csv

`digichem database count`

The count sub-command functions identically to digichem database search, except count returns the number of rows that match, rather than the rows themselves:

// How many calculations have we stored?
$ digichem database main count
50000
// And how many of them were on Benzene?
$ digichem database main count --structure "c1ccccc1"
49999

`digichem database delete`

The delete sub-command removes results from a database.

To avoid accidentally deleting data, digichem database delete expects an explicit list of result IDs to remove:

// By itself, this command does nothing.
$ digichem database results.db delete

// We need to tell Digichem what to delete:
$ digichem database results.db delete 4341561273e3e48d80f5427212e785f0d7044835

For the same reason, the delete command does not accept the common query formats (like --homo or --structure). To delete matching results, it is safer to first search for the results, and then run a separate delete command:

// Check we're deleting the correct results.
$ digichem database results.db search --structure "c1ccccc1" -t -f _id metadata:calculations metadata:package
_id                                       metadata:calculations      metadata:package
----------------------------------------  -------------------------  ------------------
5c407cd31b64517aaa0ebbee4852c24ed69dc30b  Optimisation, Frequencies  Gaussian
5346017182680cf95a6ef4cd73c343dc86692d1b  Excited States             ORCA
3c92447bade367bab402f6834208ddad0a343aaf  Optimisation, Frequencies  ORCA
475b13e7927b7cdbc150c8377a62a72bb9d26a3a  Excited States             Turbomole
5aa12c5465f42ac4befccc5de9f6e4a054d498f7  Excited States             Turbomole
3ece68dade40f931bd4533b091239bc4cad872c8  Single Point               Turbomole
4341561273e3e48d80f5427212e785f0d7044835  Excited States             Gaussian

// Only delete the Gaussian calculations.
$ digichem database results.db delete 5c407cd31b64517aaa0ebbee4852c24ed69dc30b 4341561273e3e48d80f5427212e785f0d7044835

However, if you are certain, you can delete matching results directly using the --search option:

// Dangerous! This will immediately delete all 'matching' results!
$ digichem database results.db delete --search "atoms:smiles==c1ccccc1"

Database Format

Digichem supports two database formats, TinyDB and Mongita. TinyDB (the default) is text-based, meaning it is human-readable, but may have reduced performance for very large datasets. By contrast, Mongita uses a binary format that is likely to be more performant, but can only be readily read/written to by Digichem itself (or other programs using Mongita).

To create a database in TinyDB format, use the -t option before the database file:

$ digichem database -t results.db insert FILES

Alternatively, specify the -m option to use the Mongita format:

$ digichem database -m results.db insert FILES

Once a database has been created, the -t or -m options can be omitted (Digichem will work out the correct format automatically):

$ digichem database -m results.db insert Benzene.log Pyridine.log
$ digichem database results.db count
2

Note

The -t and -m options have no effect when using a builtin database (because the format of these databases is set in the config). Trying to use either option in this scenario will result in a warning from Digichem:

$ digichem database -t main count
digichem: WARNING: UserWarning: 'main' is a builtin database and its type cannot be changed dynamically (it is set in the 'databases' config option)

Database Queries

Database queries are used to specify which calculation results in the database to act upon. There are two types of queries currently supported by Digichem.

Simple queries

Simple queries are specified using a command-line argument (like --homo or -structure). The following simple queries are currently supported:

--homo ENERGY --lumo ENERGY --beta-homo ENERGY --beta-lumo ENERGY

Search for HOMO/LUMO values that match a given energy (in eV). For unrestricted calcs, --homo and --lumo search against the alpha orbitals, while --beta-homo and --beta-lumo search against the beta orbitals.

The energy should be prefixed with either < (for less than) or > (for more than). For example:

// HOMO energy less than -5.0 eV
$ digichem database main search --homo "<-5"
// LUMO energy above 1.5 eV
$ digichem database main search --lumo ">1.5"
// HOMO energy between -4 and -6 eV.
$ digichem database main search --homo "<-4.0" --homo ">-6.0"

--singlet-energy ENERGY --triplet-energy ENERGY --dest ENERGY

Search for the energy of the first singlet excited state, first triplet excited state, or the difference between them respectively, in eV. The energy should be prefixed with either < (for less than) or > (for more than).

--singlet-wavelength NM --singlet-wavelength NM

Search for the wavelength of the first singlet excited state or the first triplet excited state respectively, in nm. The wavelength should be prefixed with either < (for less than) or > (for more than).

--structure SMILES --substructure SMILES

Search for molecules that match the given SMILES string, either exactly (--structure) or that contain the given structure (-substructure).

Important

You should always place the more-than or less-than signs (‘<’ or ‘>’) in speech marks on the command line. Otherwise, the shell will interpret them as IO redirection characters, which will result in errors such as this:

// This won't work
$ digichem database main search --lumo >5
Digichem_exception: The query string 'lumo' is missing a comparison operator
// This is correct
$ digichem database main search --lumo ">5"

Complex Queries

Important

The database query language is currently under development, and is likely to change in the future.

Each query consists of three main parts: 1) a list of fields, 2) a comparison operator, and 3) a match.

For example, in atoms:smiles==c1ccccc1, atoms:smiles is the field list, == is the comparison operator, and c1ccccc1 is the match.

The field list is delimited by colons, and matches some result in a result set. To search in a field that contains a list, the ‘any’ pseudo-field can be specified. For example, to compare orbital energies: orbitals:values:any:energy:value.
A comparison operator, one of the following:

Symbol

Meaning

==

Exact match

=

Case insensitive match

>

Greater than

<

Less than

>=

Greater than or equal to

<=

Less than or equal to

<>

Substructure matching (only for SMILES)

Each operator can optionally be prefixed with an exclamation mark (!) to negate it (ie, !== would select all entries that do not match exactly).

A match, a value to check against. Values that look like specific types (ints, floats, bools etc.) will be converted appropriately.

Symbol	Meaning
`==`	Exact match
`=`	Case insensitive match
`>`	Greater than
`<`	Less than
`>=`	Greater than or equal to
`<=`	Less than or equal to
`<>`	Substructure matching (only for SMILES)

Digichem Database

digichem database insert

digichem database slice

digichem database search

digichem database count

digichem database delete