Database Client Options
The database client provides many options for configuring how it processes documents and saves them to a database; this document describes all of them.
All these descriptions can also be found using emtellipro-db-client
--help`
and emtellipro-db-client COMMAND --help
(where
COMMAND
is one of the commands supported by the client).
The database client’s options are grouped into two parts: the global options and the commands options. The global options apply to the client overall and are shared by all the commands, and the command options are specific to the given command.
When running the database client, you’ll format the command as follows:
$ emtellipro-db-client [GLOBAL-OPTIONS] COMMAND [OPTIONS]
Where COMMAND
may be something like create-db
or
process
. The [GLOBAL-OPTIONS]
will contain options about
configuring the client (such as which database to be used, and how logging will
be handled) and [OPTIONS]
will be options for the command being run
(e.g. the process
command has options for connecting to emtelliPro,
but these would be irrelevant for create-db
).
On top of the regular command-line options, the database client also allows
passing in a configuration file using the emtellipro-db-client --config
global option. This configuration file allows the user to store commonly used
options in one place to be re-used by on multiple processing jobs.
Configuration file
The configuration file is in the TOML 1 format. Our limited use of the TOML format will look a lot like INI files, although TOML is better defined.
A simple configuration file will look something like the following:
[emtellipro-db-client]
# GLOBAL-OPTIONS will go here
database = "URL of your database"
[process]
# command-specific OPTIONS will go here
access-key = "your access key"
shared-secret = "your secret key"
server = "emtelliPro server URL"
recursive = true
bulk-insert = true
store-reports = false
max-retries = 10
The global options all go in the [emtellipro-db-client]
section, and
command specific options will go into a section named the same as the command.
The [emtellipro-db-client]
section is required, even if empty, but no
other sections are required. Any options not set in the configuration file will
simply use the defaults, or be overridden by passing command-line options.
Defaults and overriding options
Most options have defaults which will be used if not overridden by the configuration file or command-line options. The priorities of the option values are the following:
The defaults for each option are used if not otherwise set.
The options in the configuration file override the defaults (for the options specified in the configuration file).
The options passed as command-line options override the options set in the configuration file and the defaults in (1).
As a simple example, the default for log-level
is info
. If the
configuration file specifies log-level = "error"
, then only error
messages will be logged. If you then also specify --log-level
"debug"
, then the logging level will be set to “debug”, and all debugging
messages will be printed.
This hierarchy allows you (1) only set the options you care about, (2) place commonly-used options in a configuration file, and (3) set options specific to the given job on the command line without having to edit the configuration file for each job.
Option descriptions
All options are described in the following tables, with these columns:
- Command-line
This contains the command-line option names.
- Config file
This contains the config-file option name. This will be the exact same as the command-line option, without the leading hyphens.
- Type
This is the type of argument this option takes. Some options will take a string as argument, some will take a number, and some are boolean options.
Boolean options (also called “flags”) will usually have a complementing
--no-*
version which is available simply to be more explicit when running the client; omitting the flag is the exact same as specifying the--no-*
option, so specifying--no-*
acts as a reminder that the option is disabled.Boolean options in the configuration file are specified using the
true
orfalse
values.- Default
This is the default value for the given option.
Required options will say required, and you must provide a value for these in all cases, either in the configuration file or using a command-line argument. If you omit a required option, the client will prompt you for a value.
Options that have defaults are optional, but some without defaults will also marked optional because setting the option is not required.
- Description
This column contains the description of the option.
Descriptions of all the options can also be found using the --help
option. This section may sometimes go into more detail than is found in the
help message.
Global options
These options apply to the database client as a whole, and are not
command-specific. When passed on a command line, they immediately follow
emtellipro-db-client
, and precede the command name. In the
configuration file, they go in the [emtellipro-db-client]
section.
Command-line |
Config file |
Type |
Default |
Description |
---|---|---|---|---|
|
n/a |
string: path to a file |
optional |
Specifies the path to the configuration file, if used. The configuration file is described above. |
|
database |
string: a URL |
required |
The database to connect to, formatted as a URL containing all necessary
connection settings. For most database systems, a SQLAlchemy-compatible
URL format is necessary, but storing to files on disk is also supported
using special prefixes, such as: |
|
snowflake-private-key-path |
string: path to a file |
optional |
When connecting to a Snowflake database to store results, one can either
use password-based authentication by setting the password in
|
|
log-file |
string: path to a file |
errors.log |
File path for where to store logging output. |
|
log-level |
string: one of error, warning, info, or debug |
info |
The minimum level of log messages to send to the log file. |
Processing options
These options are for the process
command, and immediately follow the
process
command on the command-line; in the configuration file they go
in the [process]
section.
Command-line |
Config file |
Type |
Default |
Description |
---|---|---|---|---|
|
access-key |
string |
required |
The access key used when authenticating with emtelliPro. |
|
shared-secret |
string |
required |
The shared secret key used when authenticating with emtelliPro. |
|
server |
string: a URL |
required |
The URL of the emtelliPro server you’d like to send the reports to. |
|
category |
string |
Radiology |
The document category specified when processing reports. If the input documents come from JSON and SQL, the per-document category set on the input documents will override this option. |
|
subcategory |
string |
generic |
The document subcategory specified when processing reports. If the input documents come from JSON and SQL, the per-document subcategory set on the input documents will override this option. |
|
document-type |
string |
plain |
The document type specified when processing reports. See emtelliPro API documentation for more details about what values this option can take. |
|
section-label |
string |
optional |
The section label used for all documents, if you’d like to use
emtelliPro’s |
|
features |
string |
optional |
The emtelliPro features to request when processing the documents. This is a comma-separated list of feature names with no spaces. In the configuration file you may use either a single comma-separated string, or a list of strings (as per TOML format). See emtelliPro API documentation for what features are available. |
|
file-encoding |
string |
optional |
The file encoding to use when reading input files. If not set, it will be autodetected. If you know the file encoding ahead of time, it’s a good idea to set it here to avoid the overhead of autodetection. |
|
store-reports |
boolean |
false |
If set, the client will store the document text in the |
|
store-json |
boolean |
false |
If set, the client will store the JSON result from emtelliPro in the
|
|
store-sections-and-sentences |
boolean |
false |
If set, the client will store the text of sentences and sections in the
|
|
recursive |
boolean |
false |
Look for files to process recursively. This must be set if the input paths are directories; otherwise the client will assume all input paths represent individual files. |
|
job-id |
string |
optional |
The job ID to be inserted into the |
|
sql-query |
string |
optional |
If the input path is a database URL, this option must be set to tell the
client what query to use to retrieve documents from the input database.
The query must return at least |
|
bulk-insert |
boolean |
false |
If set, the client will use batching for faster INSERTs where available. The speed difference is significant, so this should usually be enabled. |
|
text-is-pdf |
boolean |
false |
If set, the |
|
quiet |
boolean / integer |
false / 0 |
Make the client quieter. Passing this once disables the progress bar, but sends error messages to stderr; passing it twice hides the error messages (they’ll still be present in the log file). In the config file, you can set a number to indicate how quiet you wish the client to be. |
|
max-submit-shard-size |
integer |
100 |
Maximum number of documents to submit at once. If set to -1, the number of documents submitted in each shard is automatically determined based on document size (to submit the maximum number of documents that fit within emtelliPro’s submission size limit). This option is useful for restricting the shard size further than what is technically allowed by the API; this can be helpful for lowering memory usage. |
|
max-save-shard-size |
integer |
50 |
Maximum number of documents to save to the database at once, when using
|
|
skip-database-checks |
boolean |
false |
Skip checks for whether the database schema is valid. By default the client will check on startup whether all the tables and columns in the target database have the correct schema; this avoids potential errors when trying to save to a database that needs to be migrated (or has not been created), but there’s overhead to doing this. If you’ve already confirmed the database schema is correct, you should disable the checks. |
|
max-retries |
integer |
5 |
Maximum number of times to retry failed API requests. Failures may be due to network issues, so this should be a number greater than 0. |
|
store-failed |
boolean |
false |
Store details of failed documents in the |
|
doc-id-filepath |
boolean |
false |
Use the absolute file path as the document ID when submitting .txt, .json, and .jsonl documents. By default random UUIDs are generated to anonymize input document locations when submitting documents to emtelliPro, but this option can be useful for debugging. |
|
poll-freq |
integer |
1 |
Number of seconds to wait between API calls to check processing status. Polling too frequently can put extra load on emtelliPro, so for non-interactive uses of the client, it’s helpful to set this value higher. EmtelliPro will respond as soon as processing is completed, so setting this value higher than 1 second will not lead to slower processing, but it will cause the progress bar to be updated less often. |
|
filetype |
string |
optional |
Assume all input files have the given file type, disabling automatic
detection of file type based on file extension. See the |
Other commands
Other commands do not currently take any command-specific options.
- 1
The TOML format is a full-featured configuration language, but for our purposes we’re only using simple key-value options which will look fairly similar to INI files. This document shows an example using the
[table]
style, but dotted keys are also supported. Full documentation of TOML can be found here: https://toml.io/en/