Database Client

Here we show how to use the provided emtellipro-db-client database client to:

  1. Build a set of database tables used to store the emtelliPro output

  2. Submit a document for processing, and store the output in the tables built in step 1, or retrieve a document from a database for processing, and store the output in the tables built in step 1

The goal of the database client is to allow users to pull documents from a database (or, like the basic client, process files from disk), and then store emtelliPro’s output structured data in a database. Once stored in a database, you can focus on app development or writing queries to answer the questions you want from your data.

For full documentation of the database client’s configuration options, see Database Client Options.

Installing the Database Client

The advanced database client uses the SQLAlchemy ORM to work with databases, and additional modules to be able to connect to the RDBMS you’re using. There’s a table to represent all the information available in the AnnotatedDocument instance. Technically, this client should support any database that SQLAlchemy can connect to.

This client is installed automatically when you install the emtellipro wheel file. If you’ve already followed the Installation instructions, you’re nearly there. Your next step is to install the relevant database connector:

pip install pymysql        # for MySQL
pip install psycopg2       # for PostgreSQL

Note

You may see an error about SQLite extensions not being built. If you’re not using SQLite, this will not be an issue, but if you are, you’ll want to install the libsqlite3-dev package (example shown for Debian/Ubuntu below):

sudo apt install libsqlite3-dev

If you’ll be using a MySQL database, there are some extra dependencies you’ll need (example shown for Debian/Ubuntu below):

sudo apt install libssl-dev

Preparing the database for storing emtelliPro output

As the Database Client stores the output from emtelliPro into a set of database tables, you first need to create the database in your RDBMS, then use the client itself to create the tables to store the data.

For an example of creating a database called example_db in MySQL (or PostgreSQL, the command is the same), you can log into the console of your database server and run the following command:

CREATE DATABASE example_db;

Next, we’ll have to use the Database Client to create the database tables inside of the example_db database that we’re going to use to store the emtelliPro output. In order to tell the Database Client which database to use to store the data, we use a URL like one of the following:

  • sqlite:///example.db will use example.db in the current directory

  • postgresql://postgres:your_password@localhost:5432/example_db will connect to a PostgreSQL server running on the default TCP Port (5432) on localhost as the postgres user with password of your_password to the database example_db

  • mysql+pymysql://username:your_password@host:port/example_db will connect to a MySQL server running on host host on TCP port port using username username and password your_password to the database/schema example_db

So figure out what your URL should be for your database before going onto the next step. And as a general reminder, if you’re getting connectivity errors, remember that the RDBMS has to be enabled on a port, any firewall needs to permit the traffic, and the database user may have to be enabled for remote connectivity.

So let’s create the tables inside of our example_db database. To use the Database Client to create the database tables for SQLite, run:

emtellipro-db-client --database sqlite:///example.db create-db

And to use it to create the database tables for MySQL, run the following (but modify it for your environment):

emtellipro-db-client --database mysql+pymysql://username:password@hostname:port/example_db create-db

And to use it to create the database tables for PostgreSQL, run the following (but modify it for your environment):

emtellipro-db-client --database postgresql://username:password@hostname:port/example_db create-db

Upgrading the database

If you have a previous version of the database, created using an older version of the database client (going back to v4.3.0), you can upgrade to the latest version using the migrate subcommand, as follows:

emtellipro-db-client --database postgresql://username:password@hostname:port/example_db migrate

Warning

Please ensure you’ve backed up the data before running this command. The migrate command will copy data over if columns change, and all changes will execute within a transaction (which should roll back if anything goes wrong), but it’s recommended to have a backup just in case.

Input Sources

Next we have to decide what our input source will be, i.e. where we’re going to get the reports from. The Database Client supports two ingestion modes:

  1. Reading input documents as .txt, .pdf, or .json (or .jsonl) files from disk

  2. Reading input documents from a database

These use different command-line options which are explained below.

Note

For a complete list of emtelliPro client commands, please review the command-line help. For available report submission options, run the following commands:

emtellipro-db-client process --help
emtellipro-client submit --help

Processing Text or PDF Files from Disk

The final line of the command block that we use to run the Database Client is the name of the file or directory that we want to process. Notice the following:

  • The --store-reports option is used to store the raw report in the database’s document table. If this option is not used, the default behaviour does not store the source report; however, if the input type is PDF, the report text returned by emtelliPro is always stored.

  • The --job-id option is used to store a unique identifier for the job in the document table of the database. If --job-id isn’t specified, a unique identifier will be randomly generated.

Here’s the syntax for the command:

emtellipro-db-client \
    --database sqlite:///example.db \
    process \
    --access-key ACCESS_KEY \
    --secret-key SECRET_KEY \
    --server https://api.emtelligent.com:50001 \
    --store-reports \
    --store-sections-and-sentences \
    --category DOC_CATEGORY \
    --subcategory DOC_SUBCATEGORY \
    --features FEATURE1,FEATURE2,FEATURE3.. \
    --job-id test_run_1 \
    ./path/to/document*.txt

If your test is successful, you’ll see output that looks like this:

Connecting as USERNAME
Submitting 10000 documents for processing...
Waiting for processing results...
100%|█████████████████████████████| 10000/10000 [32:18<00:00,  5.16doc/s]

Processed 10000 reports in 32:18.39s for an average per report processing
time of .193839s per report

Processing Text or PDF Documents from a Database

In this use case, we are going to retrieve CT scan reports which are stored as plain text, or PDFs (binary data) from a database, submit them to emtelliPro for processing, and store the emtelliPro output into the tables that we had previously created in the example_db database.

So before we create a command for processing, we need to decide on two things:

  1. The URL of the database for the source reports

  2. The SQL query string that we’re going to use for retrieving the reports from that database

Ingesting Text from source database

For this example, we assume that the database containing the source documents is a MySQL database named reports_db. This database runs on the same database server and uses the same user permissions as the destination database: example_db.

Hence, the URL for the source database from which we want to retrieve report from (reports_db) is:

mysql+pymysql://username:password@localhost:3306/reports_db

And the SQL query we’ll use to retrieve reports from is:

SELECT report_id AS id, report AS text FROM reports WHERE category='Radiology' and subcategory='CT' ORDER BY report_id ASC LIMIT 1

Note

The two fields emtelliPro requires are id and text, thus our use of the SQL SELECT ... AS ... to rename the source columns to the emtelliPro-required column names id and text. Also note that our clause contains single quotes for the WHERE clause, which necessitates our using double quotes for this SQL statement on the command line.

Putting our source URL and SQL statement together, here is the final command block for the Database Client - note the use of the --sql-query directive and the lack of a filename/directory name at the end of the command:

emtellipro-db-client \
    --database mysql://username:password@localhost:3306/example_db  \
    process \
    --access-key ACCESS_KEY \
    --secret-key SECRET_KEY  \
    --sql-query "SELECT report_id AS id, report AS text FROM reports WHERE category='Radiology' and subcategory='CT' ORDER BY report_id ASC LIMIT 1" 'mysql+pymysql://username:password@localhost:3306/reports_db' \
    --server https://api2.emtelligent.com:50001 \
    --store-reports \
    --store-sections-and-sentences \
    --document-type plain \
    --category Radiology \
    --subcategory CT \
    --features snomed-ontology,entity-polarity,entity-uncertainty,followup-relations,measurement-relations,text \
    --job-id ct_single_test

Note

If you’re using PostgreSQL as your target database, you can use the --bulk-insert option for the process command, which will use batching and Postgres-specific SQL extensions for faster insertion time. You can use --bulk-insert with other database types as well such as MySQL and Snowflake. By default, using this option means that the database client will batch insert 50 reports at a time into the database.

Note

By default, the database client will submit as many documents as possible that does not exceed the maximum request size limit for the emtelliPro server. You can adjust the number of documents to send in each batch to the emtelliPro server by using the --max-submit-shard-size option for the process command. By choosing a smaller number (e.g. a shard size of 50 matches the --bulk-insert command explained above) the client will start insertion into the database as soon as each batch has been processed by the emtelliPro server.

If this is successful, we should see client output:

Submitting 1 documents...
Waiting for processing results  [####################################]  100%
1 documents saved to database
Processed 1 reports in 14.29s for an average per report processing time of 14.292181s per report
Saved documents to database in .54s (.54s / document)

Ingesting PDFs from source database

If your source database contains PDFs instead of plaintext documents, the emtellipro-db-client can still process them, but you’ll need to pass the --text-is-pdf option to let it know the text column now contains PDF bytes instead of plaintext.

The SQL query will look the same as above (note that the PDF column in the source database is still renamed to text):

SELECT report_id AS id, report_pdf AS text FROM reports WHERE category='Radiology' and subcategory='CT' ORDER BY report_id ASC LIMIT 1

And the processing command is similar to the one above, but this time it contains the --text-is-pdf option:

emtellipro-db-client \
    --database mysql://username:password@localhost:3306/example_db  \
    process \
    --access-key ACCESS_KEY \
    --secret-key SECRET_KEY  \
    --sql-query "SELECT report_id AS id, report AS text FROM reports WHERE category='Radiology' and subcategory='CT' ORDER BY report_id ASC LIMIT 1" 'mysql+pymysql://username:password@localhost:3306/reports_db' \
    --text-is-pdf \
    --server https://api2.emtelligent.com:50001 \
    --store-reports \
    --store-sections-and-sentences \
    --document-type plain \
    --category Radiology \
    --subcategory CT \
    --features snomed-ontology,entity-polarity,entity-uncertainty,followup-relations,measurement-relations,text \
    --job-id ct_single_pdf_test

For this to work, the source database’s PDF column must contain unencoded PDF byte data.

Data types for storing PDFs in databases

The emtelliPro Database client supports ingestion of PDFs as byte data from report databases.

All SQL DBMS provide data types for storing binary data of variable length. The data type you select to store PDF reports will depend on your system’s requirements: the maximum size of the PDF reports that must be stored, storage and other database performance considerations. It should also be noted, however, that the default maximum size of a report that can be processed by emtelliPro is 500 MB, although this can be changed upon request.

The following table shows the commonly-used variable-length binary data types available on postgreSQL, MySQL and Microsoft SQL Server. These data types have been tested with emtelliPro Python database client to ensure that it can retrieve PDF documents stored using these data types.

DBMS

Data Type

Max PDF Size (approx)

postgreSQL

bytea

1 GB

MySQL

BLOB

16 KB

MySQL

MEDIUMBLOB

16 MB

MySQL

LONGBLOB

4 GB

MS SQL server

varbinary(max)

2 GB

Exact storage requirements for the data types above is database engine-dependent. You should refer to the documentation for the specific DBMS engine that you are using.

For MySQL, in addition to BLOB, MySQL supports the VARBINARY data type. VARBINARY is a variable-length binary that is essentially equivalent to BLOB with respect to max storage size and SQL function support. However, VARBINARY columns store data in row memory. Hence, it’s maximum storage limit of 65585 bytes is shared with all row columns so that the effective storage limit may be less than 65585 bytes. For this reason, BLOB may be preferred over VARBINARY.

Ingesting Input Documents from JSON Files

An alternative to reading documents from a database or plaintext or PDF files is to read them from a JSON or JSONL file. The benefit of this is that you can store additional document metadata from the JSON/JSONL file in the emtelliPro database’s documentmetadata table.

The JSON file must contain a list of objects, and each object must contain at least an “id”, and “text” key. A category and subcategory may also be specified in the file; if this is done, the engine will use these for processing the document. If these fields are omitted, they must be specified as command-line options when running the client. If both a category and subcategory exist in the JSON/JSONL file, and are specified on the command line, the category and subcategory specified in the file will be used by the engine for processing and the command-line options will be ignored. If there are any extra keys which have the same name as columns in the documentmetadata table, their values will be stored in that table. Any extra keys which are not known columns in the documentmetadata table will be ignored.

Please note however, that the use of additional document metadata keys is completely optional. A minimalist JSON/JSONL file that can still be processed would just contain an id and text key with their associated values; in this case, you would have to specify the category and subcategory on the command line of the emtellipro-db-client at runtime.

The JSON file(s) can use either UNIX (\n) or DOS (\r\n) line endings.

The following is a sample JSON file which also specifies a number of the optional fields from the documentmetadata table:

[
  {
    "id": "18",
    "source_document_id": "1000011",
    "category": "Cardiology",
    "subcategory": "ECG",
    "original_category": "Cardiology",
    "original_subcategory": "ECG",
    "description": "ECG",
    "text": "Atrial fibrillation with RVR.  Left axis.  Non-specific ST-T wave abnormalities.",
    "subject_id": "11110011",
    "institution": "City Hospital",
    "hadm_id": 110001121,
    "chartdate": "2022-02-16 20:48:09",
    "author_name": "Russell Sano",
    "subject_name": "Chioma Okonkwo",
    "subject_dob": "1967-01-05 03:24:04",
    "subject_gender": "female",
    "requestor": "Eun-Young Abategiovanni"
  }
]

Note

If you’d like to store PDFs in the JSON file instead of plaintext, you should replace the text key with a pdf key whose value is a Base64-encoded PDF. If the text key remains in the JSON the client will treat it as a regular JSON containing a plaintext report. You can find a sample of such a JSON file in example-data/sample_ct_pdf_in_json.json. The PDF within it is example-data/sample_ct_imagebased_report.pdf.

Assuming your file is stored as document.json, and you wanted to store it to a PostgreSQL database, the processing command could look something like:

emtellipro-db-client \
    --database postgresql://username:[email protected]:5432/clinic_notes \
    process \
    --access-key ACCESS_KEY \
    --secret-key SECRET_KEY \
    --server https://api.emtelligent.com:50001 \
    --fast-postgres \
    --store-reports \
    --store-sections-and-sentences \
    ./document.json

And a query on the corresponding PostgreSQL documentmetadata table from the emtelliPro database would contain the following information:

 id | document_id | source_document_id | original_category | original_subcategory |  institution  | subject_id | description |  hadm_id  |      chartdate      | author_name  |  subject_name  |     subject_dob     | subject_gender |        requestor
----+-------------+--------------------+-------------------+----------------------+---------------+------------+-------------+-----------+---------------------+--------------+----------------+---------------------+----------------+-------------------------
 18 |          18 | 1000011            | Cardiology        | ECG                  | City Hospital | 11110011   | ECG         | 110001121 | 2022-02-16 20:48:09 | Russell Sano | Chioma Okonkwo | 1967-01-05 03:24:04 | female         | Eun-Young Abategiovanni
(1 row)

An alternative to JSON is the JSONL format which will store each document object on a separate line:

{"id": "1", "text": "Document 1 text"}
{"id": "2", "text": "Second document text"}

Note that when using this format, the file extension must be .jsonl.

Ingesting Input Documents from CCD Files

A HL7 CCD file contains both human-readable text elements, and XML elements which provide structured data. The client extracts text to pass through emtelliPro, and metadata from the structured data elements.

Note

Processing CCD files was previously handled by a plugin, which has now been merged into the SDK.

CCD files are expected to have file extensions .ccd.xml, .xml, or .ccd. For these files the client will handle extraction of text and structured metadata from the file. See CCD Features for more details.

Storing Extended Report Metadata

If you’re building an app, let’s say where you want to search all your radiology reports for those where a pulmonary embolism was present, you’ll want to be able to just search for the reports by modality code (e.g. CT), or the report description (e.g. CTA Chest W/Contrast). In order to do this without doing complex cross-schema or cross-database joins, you’ll want to make use of the emtelliPro Database’s documentmetadata table. This table has the following structure (in PostgreSQL):

                                             Table "public.documentmetadata"
        Column        |            Type             | Collation | Nullable |                   Default
----------------------+-----------------------------+-----------+----------+----------------------------------------------
 id                   | integer                     |           | not null | nextval('documentmetadata_id_seq'::regclass)
 document_id          | integer                     |           |          |
 source_document_id   | character varying(255)      |           |          |
 original_category    | character varying(255)      |           |          |
 original_subcategory | character varying(255)      |           |          |
 institution          | character varying(255)      |           |          |
 subject_id           | character varying(255)      |           |          |
 description          | character varying(255)      |           |          |
 hadm_id              | character varying(255)      |           |          |
 chartdate            | timestamp without time zone |           |          |
 author_name          | character varying(255)      |           |          |
 subject_name         | character varying(255)      |           |          |
 subject_dob          | timestamp without time zone |           |          |
 subject_gender       | character varying(255)      |           |          |
 requestor            | character varying(255)      |           |          |
Indexes:
    "documentmetadata_pkey" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "documentmetadata_document_id_fkey" FOREIGN KEY (document_id) REFERENCES document(id)

If you are retrieving reports from a database, his metadata table allows you to store a number of additional metadata fields with your processed data to make app building and integration easier. An explanation of these fields follows below:

  • id - this field is just an auto-numbered index field

  • document_id - this field is the foreign key link to the document.id field

  • source_document_id - this field is supposed to be used for storing your unique identifier - e.g. an accession number or study UUID

  • original_category - this is suggested (if you’re in a multi-document-type environment) to be something like Radiology or Pathology, etc.

  • original_subcategory - this is suggested to represent the sub-category of your report type - in Pathology it might be cytopath or in Radiology it might be the 0008,0061 (modalities in study) field

  • institution - this can be null if just in a single-institution environment or could represent a hospital name code if in a multi-institution environment

  • subject_id - this is intended to allow storage of a unique patient identifier/MRN/PID

  • description - this is intended to store the study description, e.g. Urine Cytopathology or CT Lumbar Spine

  • hadm_id - this is intended to store a hospital admission ID

  • chartdate - this is intended to store a datetime from the original study

  • author_name - this is intended to store the author’s name

  • subject_name - this is intended to store the subject’s name

  • subject_dob - this is intended to store the subject’s date of birth

  • subject_gender - this is intended to store the subject’s gender

  • requestor - this is intended to store the document’s requestor

Note

If storing a value for chartdate or subject_dob in an SQLite database, it will first be parsed using dateutil, because the sqlite driver requires it. For all other databases, the value being inserted will have the same format as the one coming from the source database.

So, a SQL query from our source reports_db database that would store data in these fields would be:

SELECT
    report_id AS id,
    report_id AS source_document_id,
    category AS original_category,
    subcategory AS original_subcategory,
    institution,
    subject_id,
    description,
    hadm_id,
    chartdate,
    author_name,
    subject_name,
    subject_dob,
    subject_gender,
    requestor,
    report AS text
FROM reports_db
WHERE category='Radiology' AND subcategory='CT'
ORDER BY report_id ASC

And if we were going to use this same query in a processing string for the DB client, it would look like:

emtellipro-db-client \
    --database postgresql://username:password@localhost:5432/ct_reports  \
    process \
    --access-key ACCESS_KEY \
    --secret-key SECRET_KEY  \
    --sql-query "SELECT report_id AS id, report_id AS source_document_id, category AS original_category, subcategory AS original_subcategory, institution, subject_id, description, hadm_id, chartdate, author_name, subject_name, subject_dob, subject_gender, requestor, report AS text FROM reports_db WHERE category='Radiology' AND subcategory='CT' ORDER BY report_id ASC" 'mysql+pymysql://username:password@localhost:3306/reports_db' \
    --server https://api2.emtelligent.com:50001 \
    --fast-postgres \
    --store-reports \
    --store-sections-and-sentences \
    --document-type plain \
    --category Radiology \
    --subcategory CT \
    --features snomed-ontology,entity-polarity,entity-uncertainty,followup-relations,measurement-relations,experiencer-relations,text \
    --job-id ct_reports

And if we were going to query the documentmetadata table to look at how the data from our source document database was stored, it would look something like:

 id | document_id | source_document_id | original_category | original_subcategory | institution | subject_id |              description               |  hadm_id  |      chartdate      | author_name | subject_name | subject_dob | subject_gender | requestor
----+-------------+--------------------+-------------------+----------------------+-------------+------------+----------------------------------------+-----------+---------------------+-------------+--------------+-------------+----------------+-----------
  1 |           1 |          111111111 | Radiology         | CT                   | city        | 231523354  | CTA CHEST W&W/O C&RECONS, NON-CORONARY | AZ928913  | 2019-04-10 00:00:00 |             |              |             |                |
  2 |           2 |          111111112 | Radiology         | CT                   | city        | 231523354  | CT PELVIS W/O CONTRAST                 | AZ928913  | 2019-04-22 00:00:00 |             |              |             |                |
(2 rows)

Now we could build an application where we were just looking for diseases in CT reports, or just in reports that were done in a certain date range or just with a certain description.

Next steps

Assuming that you started at Installation, let’s recap what we’ve done in both the Basic Client example and in this page:

  1. Installed the package & tested that it’s set up properly.

  2. Used the emtellipro-client Basic Client to test connectivity with the API & validate our access keys

  3. Used the emtellipro-client Basic Client to submit a single test file for processing

  4. Used the emtellipro-db-client Database Client to build database tables

  5. Used the emtellipro-db-client Database Client to process a test file and store the output into a database, or retrieve a document from a database, process it, and store the output in a database

To get started building your own client, your next step is to visit the Building Your Own Client page for python code examples using the Emtellipro library.