Digichem Database
The digichem database
program is used to manage databases of completed calculation results.
It consists of of several sub-commands, depending on what you want to do to the database:
$ digichem database FILE SUB_COMMAND
FILE
is the database file you want to act upon. It can be a path to a database file:
$ digichem database results.db SUB_COMMAND
Or the name of a database pre-configured into Digichem. You can configure as many databases as you like,
but by default the only configured database that is available is the main database
:
$ digichem database main SUB_COMMAND
digichem database insert
The insert
sub-command is used to add completed calculation results to a database:
$ digichem database results.db insert FILES
Where FILES
are completed calculation log files:
$ digichem database results.db insert Benzene.log Pyridine.log
The linux wildcard character *
can be used as normal to insert many calculation results at once:
// Insert all calculation log files in the current folder.
$ digichem database results.db insert *.log
For example, to insert all the results from a computational screen submitted with Digichem with the following structure:
Benzene/
Gaussian 16 Optimisation Frequencies PBE1PBE (GD3BJ) Toluene 6-31G(d,p)/
Gaussian 16 Excited States TDA 10 Singlets 10 Triplets PBE1PBE (GD3BJ) Toluene 6-31G(d,p)/
Pyridine/
Gaussian 16 Optimisation Frequencies PBE1PBE (GD3BJ) Toluene 6-31G(d,p)
Gaussian 16 Excited States TDA 10 Singlets 10 Triplets PBE1PBE (GD3BJ) Toluene 6-31G(d,p)/
Toluene/
Gaussian 16 Optimisation Frequencies PBE1PBE (GD3BJ) Toluene 6-31G(d,p)
Gaussian 16 Excited States TDA 10 Singlets 10 Triplets PBE1PBE (GD3BJ) Toluene 6-31G(d,p)/
Use the following command:
$ digichem database results.db insert */*
Digichem will generate a unique ID for each calculation that is inserted into the database, and these will be printed at the end of the command:
$ digichem database results.db insert Benzene.log
4341561273e3e48d80f5427212e785f0d7044835
For a given calculation, this ID is always the same (even if inserting into a different database), and Digichem will not allow an identical calculation to be inserted into the same database multiple times. If you try to do this, Digichem will print a warning:
// Try to insert the same file twice
$ digichem database results.db insert Benzene.log Benzene.log
digichem: WARNING: UserWarning: Not inserting 4341561273e3e48d80f5427212e785f0d7044835, Benzene, Gaussian (2016+C.01), Optimisation, Frequencies, DFT; Document with ID 4341561273e3e48d80f5427212e785f0d7044835 already exists
4341561273e3e48d80f5427212e785f0d7044835
Note
Digichem is extremely strict on what it considers an ‘identical’ calculation. In short, the calculation log files must match exactly. This means two calculations submitted at different times, even if otherwise identical, will not be considered the same (because the completion time is included in most log files).
digichem database slice
While insert
is used to add data from computational log files, slice
is used to insert data from another database.
$ digichem database source.db slice destination.db QUERIES
Where QUERIES
are a number of search queries specifying what to copy. If none are given, then the entirety of source.db
is copied to destination.db
:
$ digichem database source.db slice destination.db
In this case, the -N
option can be specified to speed up the copying process:
$ digichem database source.db slice destination.db -N
Please see the command reference for more information on the -N
option.
The search QUERIES
are given in the same way as for the digichem database search
sub-command.
Only those calculation results that match the search criteria will be copied:
// Only copy results with a HOMO value greater than -6.0 eV.
$ digichem database source.db slice destination.db --homo ">-6"
digichem database search
The search
sub-command is used to search for (and extract) results from a database:
$ digichem database main search QUERIES
Where QUERIES
are a number of search queries specifying what to search for. If none are given, the entire database
is returned:
// Export everything from the database.
$ digichem database main search
_id: 4341561273e3e48d80f5427212e785f0d7044835
atoms:
alignment_method: Minimal
charge: 0
exact_mass:
units: g mol^-1
value: 78.04694999999998
formula: C6H6
linearity_ratio: 0.07920370892421824
molar_mass:
units: g mol^-1
value: 78.11184
... // This will go on for some time...
Queries can be specified in a number of ways. For common search criteria (orbital energies, excited states etc.), a command line argument can be used. For example, to search for all calculation results with a HOMO above (more positive) than -5.0 eV:
// Export everything from the database.
$ digichem database main search --homo ">-5.0"
Or to search for all calculations on Benzene:
// Search for calculations matching the SMILES for benzene (exact match).
$ digichem database main search --structure "c1ccccc1"
Substructure searching is also possible, using the --substructure
option (which also matches against a SMILES string):
// Search for calculations on molecules containing the benzene subunit.
$ digichem database main search --substructure "c1ccccc1"
Queries can be freely intermixed:
// Search for calculations on benzene with a HOMO above -5.0 eV.
$ digichem database main search --structure "c1ccccc1" --homo ">-5.0"
Please see the section on queries for more information.
digichem database search
can write to all the same file formats as the digichem result
command, and in fact takes the
same format options (-y
, -j
, -c
, -t
, -d
, -a
, and -s
).
By default, the program will export to the Digichem native format (-y
).
The table (-t
) format is particularly useful for viewing database results when combined with the less
command (to provide scrolling):
// View the entire database at a glance.
$ digichem database main search -t | less -S
digichem database search
can also easily export data to a file:
// Export the database to csv.
$ digichem database main search -c -O db_dump.csv
digichem database count
The count
sub-command functions identically to digichem database search
, except count
returns the number of rows that
match, rather than the rows themselves:
// How many calculations have we stored?
$ digichem database main count
50000
// And how many of them were on Benzene?
$ digichem database main count --structure "c1ccccc1"
49999
digichem database delete
The delete
sub-command removes results from a database.
To avoid accidentally deleting data, digichem database delete
expects an explicit list of result IDs to remove:
// By itself, this command does nothing.
$ digichem database results.db delete
// We need to tell Digichem what to delete:
$ digichem database results.db delete 4341561273e3e48d80f5427212e785f0d7044835
For the same reason, the delete
command does not accept the common query formats (like --homo
or --structure
).
To delete matching results, it is safer to first search
for the results, and then run a separate delete command:
// Check we're deleting the correct results.
$ digichem database results.db search --structure "c1ccccc1" -t -f _id metadata:calculations metadata:package
_id metadata:calculations metadata:package
---------------------------------------- ------------------------- ------------------
5c407cd31b64517aaa0ebbee4852c24ed69dc30b Optimisation, Frequencies Gaussian
5346017182680cf95a6ef4cd73c343dc86692d1b Excited States ORCA
3c92447bade367bab402f6834208ddad0a343aaf Optimisation, Frequencies ORCA
475b13e7927b7cdbc150c8377a62a72bb9d26a3a Excited States Turbomole
5aa12c5465f42ac4befccc5de9f6e4a054d498f7 Excited States Turbomole
3ece68dade40f931bd4533b091239bc4cad872c8 Single Point Turbomole
4341561273e3e48d80f5427212e785f0d7044835 Excited States Gaussian
// Only delete the Gaussian calculations.
$ digichem database results.db delete 5c407cd31b64517aaa0ebbee4852c24ed69dc30b 4341561273e3e48d80f5427212e785f0d7044835
However, if you are certain, you can delete matching results directly using the --search
option:
// Dangerous! This will immediately delete all 'matching' results!
$ digichem database results.db delete --search "atoms:smiles==c1ccccc1"
Database Format
Digichem supports two database formats, TinyDB and Mongita. TinyDB (the default) is text-based, meaning it is human-readable, but may have reduced performance for very large datasets. By contrast, Mongita uses a binary format that is likely to be more performant, but can only be readily read/written to by Digichem itself (or other programs using Mongita).
To create a database in TinyDB format, use the -t
option before the database file:
$ digichem database -t results.db insert FILES
Alternatively, specify the -m
option to use the Mongita format:
$ digichem database -m results.db insert FILES
Once a database has been created, the -t
or -m
options can be omitted (Digichem will work out the correct format automatically):
$ digichem database -m results.db insert Benzene.log Pyridine.log
$ digichem database results.db count
2
Note
The -t
and -m
options have no effect when using a builtin database (because the format of these databases is set in the config).
Trying to use either option in this scenario will result in a warning from Digichem:
$ digichem database -t main count
digichem: WARNING: UserWarning: 'main' is a builtin database and its type cannot be changed dynamically (it is set in the 'databases' config option)
Database Queries
Database queries are used to specify which calculation results in the database to act upon. There are two types of queries currently supported by Digichem.
Simple queries
Simple queries are specified using a command-line argument (like --homo
or -structure
).
The following simple queries are currently supported:
--homo ENERGY
--lumo ENERGY
--beta-homo ENERGY
--beta-lumo ENERGY
Search for HOMO/LUMO values that match a given energy (in eV). For unrestricted calcs,
--homo
and--lumo
search against the alpha orbitals, while--beta-homo
and--beta-lumo
search against the beta orbitals.The energy should be prefixed with either < (for less than) or > (for more than). For example:
// HOMO energy less than -5.0 eV $ digichem database main search --homo "<-5" // LUMO energy above 1.5 eV $ digichem database main search --lumo ">1.5" // HOMO energy between -4 and -6 eV. $ digichem database main search --homo "<-4.0" --homo ">-6.0"
--singlet-energy ENERGY
--triplet-energy ENERGY
--dest ENERGY
Search for the energy of the first singlet excited state, first triplet excited state, or the difference between them respectively, in eV. The energy should be prefixed with either < (for less than) or > (for more than).
--singlet-wavelength NM
--singlet-wavelength NM
Search for the wavelength of the first singlet excited state or the first triplet excited state respectively, in nm. The wavelength should be prefixed with either < (for less than) or > (for more than).
--structure SMILES
--substructure SMILES
Search for molecules that match the given SMILES string, either exactly (
--structure
) or that contain the given structure (-substructure
).
Important
You should always place the more-than or less-than signs (‘<’ or ‘>’) in speech marks on the command line. Otherwise, the shell will interpret them as IO redirection characters, which will result in errors such as this:
// This won't work
$ digichem database main search --lumo >5
Digichem_exception: The query string 'lumo' is missing a comparison operator
// This is correct
$ digichem database main search --lumo ">5"
Complex Queries
Important
The database query language is currently under development, and is likely to change in the future.
Each query consists of three main parts: 1) a list of fields, 2) a comparison operator, and 3) a match.
For example, in atoms:smiles==c1ccccc1
, atoms:smiles
is the field list, ==
is the comparison operator, and c1ccccc1
is the match.
The field list is delimited by colons, and matches some result in a result set. To search in a field that contains a list, the ‘any’ pseudo-field can be specified. For example, to compare orbital energies:
orbitals:values:any:energy:value
.A comparison operator, one of the following:
Symbol
Meaning
==
Exact match
=
Case insensitive match
>
Greater than
<
Less than
>=
Greater than or equal to
<=
Less than or equal to
<>
Substructure matching (only for SMILES)
Each operator can optionally be prefixed with an exclamation mark (
!
) to negate it (ie,!==
would select all entries that do not match exactly).
A match, a value to check against. Values that look like specific types (ints, floats, bools etc.) will be converted appropriately.