Database Client
Here we show how to use the provided emtellipro-db-client
database client to:
Build a set of database tables used to store the emtelliPro output
Submit a document for processing, and store the output in the tables built in step 1, or retrieve a document from a database for processing, and store the output in the tables built in step 1
The goal of the database client is to allow users to pull documents from a database (or, like the basic client, process files from disk), and then store emtelliPro’s output structured data in a database. Once stored in a database, you can focus on app development or writing queries to answer the questions you want from your data.
For full documentation of the database client’s configuration options, see Database Client Options.
Installing the Database Client
The advanced database client uses the SQLAlchemy ORM to work with
databases, and additional modules to be able to connect to the RDBMS
you’re using. There’s a table to represent all the information available
in the AnnotatedDocument
instance. Technically,
this client should support any database that SQLAlchemy can connect to.
This client is installed automatically when you install the emtellipro
wheel file. If you’ve already followed the Installation
instructions, you’re nearly there. Your next step is to install the
relevant database connector:
pip install pymysql # for MySQL
pip install psycopg2 # for PostgreSQL
Note
You may see an error about SQLite extensions not being built. If you’re
not using SQLite, this will not be an issue, but if you are, you’ll want to
install the libsqlite3-dev
package (example shown for Debian/Ubuntu
below):
sudo apt install libsqlite3-dev
If you’ll be using a MySQL database, there are some extra dependencies you’ll need (example shown for Debian/Ubuntu below):
sudo apt install libssl-dev
Preparing the database for storing emtelliPro output
As the Database Client stores the output from emtelliPro into a set of database tables, you first need to create the database in your RDBMS, then use the client itself to create the tables to store the data.
For an example of creating a database called example_db
in MySQL
(or PostgreSQL, the command is the same), you can log into the console
of your database server and run the following command:
CREATE DATABASE example_db;
Next, we’ll have to use the Database Client to create the database tables
inside of the example_db
database that we’re going to use to store the
emtelliPro output. In order to tell the Database Client which database
to use to store the data, we use a URL like one of the following:
sqlite:///example.db
will use example.db in the current directorypostgresql://postgres:your_password@localhost:5432/example_db
will connect to a PostgreSQL server running on the default TCP Port (5432) on localhost as thepostgres
user with password ofyour_password
to the databaseexample_db
mysql+pymysql://username:your_password@host:port/example_db
will connect to a MySQL server running on hosthost
on TCP portport
using usernameusername
and passwordyour_password
to the database/schemaexample_db
So figure out what your URL should be for your database before going onto the next step. And as a general reminder, if you’re getting connectivity errors, remember that the RDBMS has to be enabled on a port, any firewall needs to permit the traffic, and the database user may have to be enabled for remote connectivity.
So let’s create the tables inside of our example_db
database.
To use the Database Client to create the database tables for SQLite, run:
emtellipro-db-client --database sqlite:///example.db create-db
And to use it to create the database tables for MySQL, run the following (but modify it for your environment):
emtellipro-db-client --database mysql+pymysql://username:password@hostname:port/example_db create-db
And to use it to create the database tables for PostgreSQL, run the following (but modify it for your environment):
emtellipro-db-client --database postgresql://username:password@hostname:port/example_db create-db
Upgrading the database
If you have a previous version of the database, created using an older version
of the database client (going back to v4.3.0), you can upgrade to the latest
version using the migrate
subcommand, as follows:
emtellipro-db-client --database postgresql://username:password@hostname:port/example_db migrate
Warning
Please ensure you’ve backed up the data before running this command.
The migrate
command will copy data over if columns change, and all
changes will execute within a transaction (which should roll back if
anything goes wrong), but it’s recommended to have a backup just in case.
Input Sources
Next we have to decide what our input source will be, i.e. where we’re going to get the reports from. The Database Client supports two ingestion modes:
Reading input documents as .txt, .pdf, or .json (or .jsonl) files from disk
Reading input documents from a database
These use different command-line options which are explained below.
Note
For a complete list of emtelliPro client commands, please review the command-line help. For available report submission options, run the following commands:
emtellipro-db-client process --help
emtellipro-client submit --help
Processing Text or PDF Files from Disk
The final line of the command block that we use to run the Database Client is the name of the file or directory that we want to process. Notice the following:
The
--store-reports
option is used to store the raw report in the database’sdocument
table. If this option is not used, the default behaviour does not store the source report; however, if the input type is PDF, the report text returned by emtelliPro is always stored.The
--job-id
option is used to store a unique identifier for the job in thedocument
table of the database. If--job-id
isn’t specified, a unique identifier will be randomly generated.
Here’s the syntax for the command:
emtellipro-db-client \
--database sqlite:///example.db \
process \
--access-key ACCESS_KEY \
--secret-key SECRET_KEY \
--server https://api.emtelligent.com:50001 \
--store-reports \
--store-sections-and-sentences \
--category DOC_CATEGORY \
--subcategory DOC_SUBCATEGORY \
--features FEATURE1,FEATURE2,FEATURE3.. \
--job-id test_run_1 \
./path/to/document*.txt
If your test is successful, you’ll see output that looks like this:
Connecting as USERNAME
Submitting 10000 documents for processing...
Waiting for processing results...
100%|█████████████████████████████| 10000/10000 [32:18<00:00, 5.16doc/s]
Processed 10000 reports in 32:18.39s for an average per report processing
time of .193839s per report
Processing Text or PDF Documents from a Database
In this use case, we are going to retrieve CT scan reports which are stored as
plain text, or PDFs (binary data) from a database, submit them to emtelliPro
for processing, and store the emtelliPro output into the tables that we had
previously created in the example_db
database.
So before we create a command for processing, we need to decide on two things:
The URL of the database for the source reports
The SQL query string that we’re going to use for retrieving the reports from that database
Ingesting Text from source database
For this example, we assume that the database containing the source documents
is a MySQL database named reports_db
. This database runs on the same
database server and uses the same user permissions as the destination database:
example_db
.
Hence, the URL for the source database from which we want to retrieve report
from (reports_db
) is:
mysql+pymysql://username:password@localhost:3306/reports_db
And the SQL query we’ll use to retrieve reports from is:
SELECT report_id AS id, report AS text FROM reports WHERE category='Radiology' and subcategory='CT' ORDER BY report_id ASC LIMIT 1
Note
The two fields emtelliPro requires are id
and text
, thus our use of
the SQL SELECT ... AS ...
to rename the source columns to the
emtelliPro-required column names id
and text
. Also note that our
clause contains single quotes for the WHERE clause, which necessitates our
using double quotes for this SQL statement on the command line.
Putting our source URL and SQL statement together, here is the
final command block for the Database Client - note the use of the
--sql-query
directive and the lack of a filename/directory name at
the end of the command:
emtellipro-db-client \
--database mysql://username:password@localhost:3306/example_db \
process \
--access-key ACCESS_KEY \
--secret-key SECRET_KEY \
--sql-query "SELECT report_id AS id, report AS text FROM reports WHERE category='Radiology' and subcategory='CT' ORDER BY report_id ASC LIMIT 1" 'mysql+pymysql://username:password@localhost:3306/reports_db' \
--server https://api2.emtelligent.com:50001 \
--store-reports \
--store-sections-and-sentences \
--document-type plain \
--category Radiology \
--subcategory CT \
--features snomed-ontology,entity-polarity,entity-uncertainty,followup-relations,measurement-relations,text \
--job-id ct_single_test
Note
If you’re using PostgreSQL as your target database, you can use the
--bulk-insert
option for the process
command, which will use
batching and Postgres-specific SQL extensions for faster insertion
time. You can use --bulk-insert
with other database types as
well such as MySQL and Snowflake. By default, using this option means
that the database client will batch insert 50 reports at a time into
the database.
Note
By default, the database client will submit as many documents as
possible that does not exceed the maximum request size limit for
the emtelliPro server. You can adjust the number of documents
to send in each batch to the emtelliPro server by using the
--max-submit-shard-size
option for the process
command. By
choosing a smaller number (e.g. a shard size of 50 matches the
--bulk-insert
command explained above) the client will start
insertion into the database as soon as each batch has been processed
by the emtelliPro server.
If this is successful, we should see client output:
Submitting 1 documents...
Waiting for processing results [####################################] 100%
1 documents saved to database
Processed 1 reports in 14.29s for an average per report processing time of 14.292181s per report
Saved documents to database in .54s (.54s / document)
Ingesting PDFs from source database
If your source database contains PDFs instead of plaintext documents, the
emtellipro-db-client
can still process them, but you’ll need to pass the
--text-is-pdf
option to let it know the text
column now contains PDF
bytes instead of plaintext.
The SQL query will look the same as above (note that the PDF column in the
source database is still renamed to text
):
SELECT report_id AS id, report_pdf AS text FROM reports WHERE category='Radiology' and subcategory='CT' ORDER BY report_id ASC LIMIT 1
And the processing command is similar to the one above, but this time it
contains the --text-is-pdf
option:
emtellipro-db-client \
--database mysql://username:password@localhost:3306/example_db \
process \
--access-key ACCESS_KEY \
--secret-key SECRET_KEY \
--sql-query "SELECT report_id AS id, report AS text FROM reports WHERE category='Radiology' and subcategory='CT' ORDER BY report_id ASC LIMIT 1" 'mysql+pymysql://username:password@localhost:3306/reports_db' \
--text-is-pdf \
--server https://api2.emtelligent.com:50001 \
--store-reports \
--store-sections-and-sentences \
--document-type plain \
--category Radiology \
--subcategory CT \
--features snomed-ontology,entity-polarity,entity-uncertainty,followup-relations,measurement-relations,text \
--job-id ct_single_pdf_test
For this to work, the source database’s PDF column must contain unencoded PDF byte data.
Data types for storing PDFs in databases
The emtelliPro Database client supports ingestion of PDFs as byte data from report databases.
All SQL DBMS provide data types for storing binary data of variable length. The data type you select to store PDF reports will depend on your system’s requirements: the maximum size of the PDF reports that must be stored, storage and other database performance considerations. It should also be noted, however, that the default maximum size of a report that can be processed by emtelliPro is 500 MB, although this can be changed upon request.
The following table shows the commonly-used variable-length binary data types available on postgreSQL, MySQL and Microsoft SQL Server. These data types have been tested with emtelliPro Python database client to ensure that it can retrieve PDF documents stored using these data types.
DBMS |
Data Type |
Max PDF Size (approx) |
---|---|---|
postgreSQL |
bytea |
1 GB |
MySQL |
BLOB |
16 KB |
MySQL |
MEDIUMBLOB |
16 MB |
MySQL |
LONGBLOB |
4 GB |
MS SQL server |
varbinary(max) |
2 GB |
Exact storage requirements for the data types above is database engine-dependent. You should refer to the documentation for the specific DBMS engine that you are using.
For MySQL, in addition to BLOB, MySQL supports the VARBINARY data type. VARBINARY is a variable-length binary that is essentially equivalent to BLOB with respect to max storage size and SQL function support. However, VARBINARY columns store data in row memory. Hence, it’s maximum storage limit of 65585 bytes is shared with all row columns so that the effective storage limit may be less than 65585 bytes. For this reason, BLOB may be preferred over VARBINARY.
Ingesting Input Documents from JSON Files
An alternative to reading documents from a database or plaintext or PDF files is to
read them from a JSON or JSONL file. The benefit of this is that you can store additional
document metadata from the JSON/JSONL file in the emtelliPro database’s documentmetadata
table.
The JSON file must contain a list of objects, and each object must contain at
least an “id”, and “text” key. A category
and subcategory
may also be
specified in the file; if this is done, the engine will use these for processing the document.
If these fields are omitted, they must be specified as command-line options when running the client.
If both a category
and subcategory
exist in the JSON/JSONL file, and are specified on the command line,
the category
and subcategory
specified in the file will be used by the engine for processing and the
command-line options will be ignored. If there are any extra keys which have the same name as columns in the
documentmetadata
table, their values will be stored in that table. Any extra keys which are not known
columns in the documentmetadata
table will be ignored.
Please note however, that the use of additional document metadata keys is completely optional. A minimalist JSON/JSONL file
that can still be processed would just contain an id
and text
key with their associated values; in this case,
you would have to specify the category
and subcategory
on the command line of the emtellipro-db-client
at runtime.
The JSON file(s) can use either UNIX (\n
) or DOS (\r\n
) line endings.
The following is a sample JSON file which also specifies a number of the optional fields from the
documentmetadata
table:
[
{
"id": "18",
"source_document_id": "1000011",
"category": "Cardiology",
"subcategory": "ECG",
"original_category": "Cardiology",
"original_subcategory": "ECG",
"description": "ECG",
"text": "Atrial fibrillation with RVR. Left axis. Non-specific ST-T wave abnormalities.",
"subject_id": "11110011",
"institution": "City Hospital",
"hadm_id": 110001121,
"chartdate": "2022-02-16 20:48:09",
"author_name": "Russell Sano",
"subject_name": "Chioma Okonkwo",
"subject_dob": "1967-01-05 03:24:04",
"subject_gender": "female",
"requestor": "Eun-Young Abategiovanni"
}
]
Note
If you’d like to store PDFs in the JSON file instead of plaintext, you
should replace the text
key with a pdf
key whose value is a
Base64-encoded PDF. If the text
key remains in the JSON the client will
treat it as a regular JSON containing a plaintext report. You can find a
sample of such a JSON file in example-data/sample_ct_pdf_in_json.json
.
The PDF within it is example-data/sample_ct_imagebased_report.pdf
.
Assuming your file is stored as document.json
, and you wanted to store it to a
PostgreSQL database, the processing command could look something like:
emtellipro-db-client \
--database postgresql://username:[email protected]:5432/clinic_notes \
process \
--access-key ACCESS_KEY \
--secret-key SECRET_KEY \
--server https://api.emtelligent.com:50001 \
--fast-postgres \
--store-reports \
--store-sections-and-sentences \
./document.json
And a query on the corresponding PostgreSQL documentmetadata
table from the emtelliPro
database would contain the following information:
id | document_id | source_document_id | original_category | original_subcategory | institution | subject_id | description | hadm_id | chartdate | author_name | subject_name | subject_dob | subject_gender | requestor
----+-------------+--------------------+-------------------+----------------------+---------------+------------+-------------+-----------+---------------------+--------------+----------------+---------------------+----------------+-------------------------
18 | 18 | 1000011 | Cardiology | ECG | City Hospital | 11110011 | ECG | 110001121 | 2022-02-16 20:48:09 | Russell Sano | Chioma Okonkwo | 1967-01-05 03:24:04 | female | Eun-Young Abategiovanni
(1 row)
An alternative to JSON is the JSONL format which will store each document object on a separate line:
{"id": "1", "text": "Document 1 text"}
{"id": "2", "text": "Second document text"}
Note that when using this format, the file extension must be .jsonl
.
Ingesting Input Documents from CCD Files
A HL7 CCD file contains both human-readable text elements, and XML elements which provide structured data. The client extracts text to pass through emtelliPro, and metadata from the structured data elements.
Note
Processing CCD files was previously handled by a plugin, which has now been merged into the SDK.
CCD files are expected to have file extensions .ccd.xml
, .xml
, or
.ccd
. For these files the client will handle extraction of text and structured
metadata from the file. See CCD Features for more details.
Storing Extended Report Metadata
If you’re building an app, let’s say where you want to search all your
radiology reports for those where a pulmonary embolism was present,
you’ll want to be able to just search for the reports by modality
code (e.g. CT
), or the report description (e.g. CTA Chest
W/Contrast
). In order to do this without doing complex cross-schema or
cross-database joins, you’ll want to make use of the emtelliPro Database’s
documentmetadata
table. This table has the following structure
(in PostgreSQL):
Table "public.documentmetadata"
Column | Type | Collation | Nullable | Default
----------------------+-----------------------------+-----------+----------+----------------------------------------------
id | integer | | not null | nextval('documentmetadata_id_seq'::regclass)
document_id | integer | | |
source_document_id | character varying(255) | | |
original_category | character varying(255) | | |
original_subcategory | character varying(255) | | |
institution | character varying(255) | | |
subject_id | character varying(255) | | |
description | character varying(255) | | |
hadm_id | character varying(255) | | |
chartdate | timestamp without time zone | | |
author_name | character varying(255) | | |
subject_name | character varying(255) | | |
subject_dob | timestamp without time zone | | |
subject_gender | character varying(255) | | |
requestor | character varying(255) | | |
Indexes:
"documentmetadata_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
"documentmetadata_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id)
If you are retrieving reports from a database, his metadata table allows you to store a number of additional metadata fields with your processed data to make app building and integration easier. An explanation of these fields follows below:
id
- this field is just an auto-numbered index fielddocument_id
- this field is the foreign key link to thedocument.id
fieldsource_document_id
- this field is supposed to be used for storing your unique identifier - e.g. an accession number or study UUIDoriginal_category
- this is suggested (if you’re in a multi-document-type environment) to be something likeRadiology
orPathology
, etc.original_subcategory
- this is suggested to represent the sub-category of your report type - in Pathology it might becytopath
or in Radiology it might be the0008,0061
(modalities in study) fieldinstitution
- this can be null if just in a single-institution environment or could represent a hospital name code if in a multi-institution environmentsubject_id
- this is intended to allow storage of a unique patient identifier/MRN/PIDdescription
- this is intended to store the study description, e.g.Urine Cytopathology
orCT Lumbar Spine
hadm_id
- this is intended to store a hospital admission IDchartdate
- this is intended to store a datetime from the original studyauthor_name
- this is intended to store the author’s namesubject_name
- this is intended to store the subject’s namesubject_dob
- this is intended to store the subject’s date of birthsubject_gender
- this is intended to store the subject’s genderrequestor
- this is intended to store the document’s requestor
Note
If storing a value for chartdate
or subject_dob
in an SQLite
database, it will first be parsed using dateutil,
because the sqlite driver requires it. For all other databases, the value
being inserted will have the same format as the one coming from the source
database.
So, a SQL query from our source reports_db
database that would store
data in these fields would be:
SELECT
report_id AS id,
report_id AS source_document_id,
category AS original_category,
subcategory AS original_subcategory,
institution,
subject_id,
description,
hadm_id,
chartdate,
author_name,
subject_name,
subject_dob,
subject_gender,
requestor,
report AS text
FROM reports_db
WHERE category='Radiology' AND subcategory='CT'
ORDER BY report_id ASC
And if we were going to use this same query in a processing string for the DB client, it would look like:
emtellipro-db-client \
--database postgresql://username:password@localhost:5432/ct_reports \
process \
--access-key ACCESS_KEY \
--secret-key SECRET_KEY \
--sql-query "SELECT report_id AS id, report_id AS source_document_id, category AS original_category, subcategory AS original_subcategory, institution, subject_id, description, hadm_id, chartdate, author_name, subject_name, subject_dob, subject_gender, requestor, report AS text FROM reports_db WHERE category='Radiology' AND subcategory='CT' ORDER BY report_id ASC" 'mysql+pymysql://username:password@localhost:3306/reports_db' \
--server https://api2.emtelligent.com:50001 \
--fast-postgres \
--store-reports \
--store-sections-and-sentences \
--document-type plain \
--category Radiology \
--subcategory CT \
--features snomed-ontology,entity-polarity,entity-uncertainty,followup-relations,measurement-relations,experiencer-relations,text \
--job-id ct_reports
And if we were going to query the documentmetadata
table to look at
how the data from our source document database was stored, it would look
something like:
id | document_id | source_document_id | original_category | original_subcategory | institution | subject_id | description | hadm_id | chartdate | author_name | subject_name | subject_dob | subject_gender | requestor
----+-------------+--------------------+-------------------+----------------------+-------------+------------+----------------------------------------+-----------+---------------------+-------------+--------------+-------------+----------------+-----------
1 | 1 | 111111111 | Radiology | CT | city | 231523354 | CTA CHEST W&W/O C&RECONS, NON-CORONARY | AZ928913 | 2019-04-10 00:00:00 | | | | |
2 | 2 | 111111112 | Radiology | CT | city | 231523354 | CT PELVIS W/O CONTRAST | AZ928913 | 2019-04-22 00:00:00 | | | | |
(2 rows)
Now we could build an application where we were just looking for diseases in CT reports, or just in reports that were done in a certain date range or just with a certain description.
Next steps
Assuming that you started at Installation, let’s recap what we’ve done in both the Basic Client example and in this page:
Installed the package & tested that it’s set up properly.
Used the
emtellipro-client
Basic Client to test connectivity with the API & validate our access keysUsed the
emtellipro-client
Basic Client to submit a single test file for processingUsed the
emtellipro-db-client
Database Client to build database tablesUsed the
emtellipro-db-client
Database Client to process a test file and store the output into a database, or retrieve a document from a database, process it, and store the output in a database
To get started building your own client, your next step is to visit the
Building Your Own Client page for python code examples using the Emtellipro
library.