Files
email_to_xml/README.md
2025-01-03 06:52:46 +13:00

4.0 KiB

email_to_xml

What is it?

A relatively simple script designed to grab emailed CSV data from a mailbox, and convert it to validated clean XML for import into another system.

How does it work?

The getmail.py script is designed to poll an IMAP mailbox when run.

It will find any emails with attachments, marking them as read, and saving the attachment to the 'attachments' directory for later processing.

Attachments are output named with the current date-time, and a semi-random uid based on the file hash in order to prevent namespace collisions.

Attachments are expected to be CSV format, but can be in various field orders, though a header row is required.

No file locking is used, however files are written to a temporary directory first, flushed, and renamed upon completion, as renaming is an atomic operation at an OS level.

Attachments will not be created if they have an identical hash to a previously downloaded attachment. This is designed to prevent scenarios where the same file has been accidentally sent multiple times. Note that this identification is done based on file content, and the name of the file is irrelevant.

Once an attachment is saved, we look up the mapping of fields in the attachment for input, using a custom mapping based on the domain of the email address of the sender. Different companies may use differing CSV formats, but we work on the assumption that the XML will need to be the same and based on a well defined DTD.

After mapping and conversion to XML, the file is validated against the DTD and written to the xml directory for consumption by the target software.

Logging is fairly primitive and done to a log file in the same directory as the script. This could be upgraded to syslog-style logging if required.

This is set up for simple IMAP SSL authentication using TLS with implied STARTTLS. If manual STARTTLS is required, the MailBox method will need to be altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be necessary to implement OAUTH2.

Authentication and other user configuration is configured in a .env file as described below.

How is it configured?

Here is an example .env file:

MBOX_USER = 'testmail'
MBOX_PASS = 'supersecretpassword'
MAIL_HOST = 'imap.some.server.com'
MAIL_PORT = 993
DTD = 'items.dtd'

Note that the DTD file should be provided as a name only but is expected to be found in the xml subdirectory.

When run for the first time (or more precisely, when the first attachment is downloaded) a saved_hashes.json file will be created in this script's directory. This file should be backed up if the attachment history is important as it is this file which prevents duplicate attachments being downloaded and processed.

The script also expects to find a column_mapping.json file in it's directory containing CSV columns to XML fields. Unused columns in a CSV can simply be left out and they will be ignored. The title of each json object in this file should match the domain of the sender of the email containing the CSV attachment. The "default" object will be used if no specific match is found.

Here is an example column_mapping.json file:

{
  "default": {
    "Item_ID": "Item_ID",
    "Item_Name": "Item_Name",
    "Item_Description": "Item_Description",
    "Item_Price": "Item_Price",
    "Item_Quantity": "Item_Quantity"
  },
  "hamiltron.net": {
    "Item_ID": "ID",
    "Item_Name": "Name",
    "Item_Description": "Description",
    "Item_Price": "Price",
    "Item_Quantity": "Quantity"
  }
}

The script will also create the directories temp (used for temporary file processing) and attachments where CSV files are stored. If it is not important to keep attachments, the latter folder can be ignored, purged, or even removed entirely - it will be recreated on next run to no ill effect.

Processed XML files ready for final consumption live in the xml directory along with the DTD. This directory is also unimportant for backup purposes depending on your requirements for xml files post-consumption. That said, The DTD file MUST exist there.