Database Client Options

The database client provides many options for configuring how it processes documents and saves them to a database; this document describes all of them.

All these descriptions can also be found using emtellipro-db-client --help` and emtellipro-db-client COMMAND --help (where COMMAND is one of the commands supported by the client).

The database client’s options are grouped into two parts: the global options and the commands options. The global options apply to the client overall and are shared by all the commands, and the command options are specific to the given command.

When running the database client, you’ll format the command as follows:

$ emtellipro-db-client [GLOBAL-OPTIONS] COMMAND [OPTIONS]

Where COMMAND may be something like create-db or process. The [GLOBAL-OPTIONS] will contain options about configuring the client (such as which database to be used, and how logging will be handled) and [OPTIONS] will be options for the command being run (e.g. the process command has options for connecting to emtelliPro, but these would be irrelevant for create-db).

On top of the regular command-line options, the database client also allows passing in a configuration file using the emtellipro-db-client --config global option. This configuration file allows the user to store commonly used options in one place to be re-used by on multiple processing jobs.

Configuration file

The configuration file is in the TOML 1 format. Our limited use of the TOML format will look a lot like INI files, although TOML is better defined.

A simple configuration file will look something like the following:

[emtellipro-db-client]
# GLOBAL-OPTIONS will go here
database = "URL of your database"

[process]
# command-specific OPTIONS will go here
access-key = "your access key"
shared-secret = "your secret key"
server = "emtelliPro server URL"

recursive = true
bulk-insert = true
store-reports = false

max-retries = 10

The global options all go in the [emtellipro-db-client] section, and command specific options will go into a section named the same as the command.

The [emtellipro-db-client] section is required, even if empty, but no other sections are required. Any options not set in the configuration file will simply use the defaults, or be overridden by passing command-line options.

Defaults and overriding options

Most options have defaults which will be used if not overridden by the configuration file or command-line options. The priorities of the option values are the following:

  1. The defaults for each option are used if not otherwise set.

  2. The options in the configuration file override the defaults (for the options specified in the configuration file).

  3. The options passed as command-line options override the options set in the configuration file and the defaults in (1).

As a simple example, the default for log-level is info. If the configuration file specifies log-level = "error", then only error messages will be logged. If you then also specify --log-level "debug", then the logging level will be set to “debug”, and all debugging messages will be printed.

This hierarchy allows you (1) only set the options you care about, (2) place commonly-used options in a configuration file, and (3) set options specific to the given job on the command line without having to edit the configuration file for each job.

Option descriptions

All options are described in the following tables, with these columns:

Command-line

This contains the command-line option names.

Config file

This contains the config-file option name. This will be the exact same as the command-line option, without the leading hyphens.

Type

This is the type of argument this option takes. Some options will take a string as argument, some will take a number, and some are boolean options.

Boolean options (also called “flags”) will usually have a complementing --no-* version which is available simply to be more explicit when running the client; omitting the flag is the exact same as specifying the --no-* option, so specifying --no-* acts as a reminder that the option is disabled.

Boolean options in the configuration file are specified using the true or false values.

Default

This is the default value for the given option.

Required options will say required, and you must provide a value for these in all cases, either in the configuration file or using a command-line argument. If you omit a required option, the client will prompt you for a value.

Options that have defaults are optional, but some without defaults will also marked optional because setting the option is not required.

Description

This column contains the description of the option.

Descriptions of all the options can also be found using the --help option. This section may sometimes go into more detail than is found in the help message.

Global options

These options apply to the database client as a whole, and are not command-specific. When passed on a command line, they immediately follow emtellipro-db-client, and precede the command name. In the configuration file, they go in the [emtellipro-db-client] section.

Command-line

Config file

Type

Default

Description

-c, --config

n/a

string: path to a file

optional

Specifies the path to the configuration file, if used. The configuration file is described above.

--database

database

string: a URL

required

The database to connect to, formatted as a URL containing all necessary connection settings. For most database systems, a SQLAlchemy-compatible URL format is necessary, but storing to files on disk is also supported using special prefixes, such as: jsonl://PATH, json://PATH, csv://PATH, and raw://PATH which allow storing results in jsonl, json, csv and unprocessed result files (respectively) in the provided directory PATH.

--snowflake-private-key-path

snowflake-private-key-path

string: path to a file

optional

When connecting to a Snowflake database to store results, one can either use password-based authentication by setting the password in --database, or by setting this option and using key-pair authentication.

--log-file

log-file

string: path to a file

errors.log

File path for where to store logging output.

--log-level

log-level

string: one of error, warning, info, or debug

info

The minimum level of log messages to send to the log file.

Processing options

These options are for the process command, and immediately follow the process command on the command-line; in the configuration file they go in the [process] section.

Command-line

Config file

Type

Default

Description

--access-key

access-key

string

required

The access key used when authenticating with emtelliPro.

--shared-secret

shared-secret

string

required

The shared secret key used when authenticating with emtelliPro.

--server

server

string: a URL

required

The URL of the emtelliPro server you’d like to send the reports to.

--category

category

string

Radiology

The document category specified when processing reports. If the input documents come from JSON and SQL, the per-document category set on the input documents will override this option.

--subcategory

subcategory

string

generic

The document subcategory specified when processing reports. If the input documents come from JSON and SQL, the per-document subcategory set on the input documents will override this option.

--document-type

document-type

string

plain

The document type specified when processing reports. See emtelliPro API documentation for more details about what values this option can take.

--section-label

section-label

string

optional

The section label used for all documents, if you’d like to use emtelliPro’s process_with_section_labels feature. Setting this option will tell emtelliPro to assume the entire input document is part of a single section (with this section label).

--features

features

string

optional

The emtelliPro features to request when processing the documents. This is a comma-separated list of feature names with no spaces. In the configuration file you may use either a single comma-separated string, or a list of strings (as per TOML format). See emtelliPro API documentation for what features are available.

--file-encoding

file-encoding

string

optional

The file encoding to use when reading input files. If not set, it will be autodetected. If you know the file encoding ahead of time, it’s a good idea to set it here to avoid the overhead of autodetection.

--store-reports /
--no-store-reports

store-reports

boolean

false

If set, the client will store the document text in the document.text database column. If the input document is a PDF, the document text will be stored regardless of this option.

--store-json /
--no-store-json

store-json

boolean

false

If set, the client will store the JSON result from emtelliPro in the document.json_representation database column.

--store-sections-and-sentences /
--no-sections-and-sentences

store-sections-and-sentences

boolean

false

If set, the client will store the text of sentences and sections in the sentencelocation and sectionlocation tables, respectively. This allows the user to avoid having to compute the text based on the document text and sentence/section spans.

-r, --recursive

recursive

boolean

false

Look for files to process recursively. This must be set if the input paths are directories; otherwise the client will assume all input paths represent individual files.

--job-id

job-id

string

optional

The job ID to be inserted into the processingdetails table. This can be used to group documents from multiple runs of the client into a single “job”. If not set, a random UUID will be generated.

--sql-query

sql-query

string

optional

If the input path is a database URL, this option must be set to tell the client what query to use to retrieve documents from the input database. The query must return at least id and text columns; any extra fields are stored in the documentmetadata table. If category and subcategory are included, they will be stored as original_category and original_subcategory.

--bulk-insert /
--no-bulk-insert

bulk-insert

boolean

false

If set, the client will use batching for faster INSERTs where available. The speed difference is significant, so this should usually be enabled.

--text-is-pdf

text-is-pdf

boolean

false

If set, the text column in the input SQL query will be treated as raw PDF bytes, rather than plaintext. This implies --store-reports.

--quiet

quiet

boolean / integer

false / 0

Make the client quieter. Passing this once disables the progress bar, but sends error messages to stderr; passing it twice hides the error messages (they’ll still be present in the log file). In the config file, you can set a number to indicate how quiet you wish the client to be.

--max-submit-shard-size

max-submit-shard-size

integer

100

Maximum number of documents to submit at once. If set to -1, the number of documents submitted in each shard is automatically determined based on document size (to submit the maximum number of documents that fit within emtelliPro’s submission size limit). This option is useful for restricting the shard size further than what is technically allowed by the API; this can be helpful for lowering memory usage.

--max-save-shard-size

max-save-shard-size

integer

50

Maximum number of documents to save to the database at once, when using --bulk-insert. Large numbers can cause database errors, so this option is best kept under 50.

--skip-database-checks /
--no-skip-database-checks

skip-database-checks

boolean

false

Skip checks for whether the database schema is valid. By default the client will check on startup whether all the tables and columns in the target database have the correct schema; this avoids potential errors when trying to save to a database that needs to be migrated (or has not been created), but there’s overhead to doing this. If you’ve already confirmed the database schema is correct, you should disable the checks.

--max-retries

max-retries

integer

5

Maximum number of times to retry failed API requests. Failures may be due to network issues, so this should be a number greater than 0.

--store-failed /
--no-store-failed

store-failed

boolean

false

Store details of failed documents in the documents table. You can find these by querying for documents.processing_status = 'error'.

--doc-id-filepath

doc-id-filepath

boolean

false

Use the absolute file path as the document ID when submitting .txt, .json, and .jsonl documents. By default random UUIDs are generated to anonymize input document locations when submitting documents to emtelliPro, but this option can be useful for debugging.

--poll-freq

poll-freq

integer

1

Number of seconds to wait between API calls to check processing status. Polling too frequently can put extra load on emtelliPro, so for non-interactive uses of the client, it’s helpful to set this value higher. EmtelliPro will respond as soon as processing is completed, so setting this value higher than 1 second will not lead to slower processing, but it will cause the progress bar to be updated less often.

--filetype

filetype

string

optional

Assume all input files have the given file type, disabling automatic detection of file type based on file extension. See the --help option for available options (since it depends on installed plugins).

Other commands

Other commands do not currently take any command-specific options.

1

The TOML format is a full-featured configuration language, but for our purposes we’re only using simple key-value options which will look fairly similar to INI files. This document shows an example using the [table] style, but dotted keys are also supported. Full documentation of TOML can be found here: https://toml.io/en/