5.4 KiB
email_to_xml
Version History
2025-01-03: Initial release
TODO
- Implement OAUTH2 authentication for GMail / Office365 / etc.
- Optionally support multiple CSV formats per client
What is it?
A relatively simple script designed to grab emailed CSV data from a mailbox, and convert it to validated clean XML for import into another system.
How does it work?
The getmail.py script is designed to poll an IMAP mailbox when run.
It will find any emails with attachments, marking them as read, and saving the attachment to the 'attachments' directory for later processing.
Attachments are output named with the current date-time, and a semi-random uid based on the file hash in order to prevent namespace collisions.
Attachments are expected to be CSV format, but can be in various field orders, though a header row is required.
No file locking is used, however files are written to a temporary directory first, flushed, and renamed upon completion, as renaming is an atomic operation at an OS level.
Attachments will not be created if they have an identical hash to a previously downloaded attachment. This is designed to prevent scenarios where the same file has been accidentally sent multiple times. Note that this identification is done based on file content, and the name of the file is irrelevant.
Once an attachment is saved, we look up the mapping of fields in the attachment for input, using a custom mapping based on the domain of the email address of the sender. Different companies may use differing CSV formats, but we work on the assumption that the XML will need to be the same and based on a well defined DTD.
After mapping and conversion to XML, the file is validated against the DTD and written to the xml directory for consumption by the target software.
Logging is fairly primitive and done to a log file in the same directory as the script. This could be upgraded to syslog-style logging if required.
This is set up for simple IMAP SSL authentication using TLS with implied STARTTLS. If manual STARTTLS is required, the MailBox method will need to be altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be necessary to implement OAUTH2.
Authentication and other user configuration is configured in a .env file as described below.
How is it configured?
Here is an example .env file:
MBOX_USER = 'testmail'
MBOX_PASS = 'supersecretpassword'
MAIL_HOST = 'imap.some.server.com'
MAIL_PORT = 993
DTD = 'items.dtd'
Note that the DTD file should be provided as a name only but is expected to be
found in the xml subdirectory.
When run for the first time (or more precisely, when the first attachment is downloaded) a saved_hashes.json file will be created in this script's directory. This file should be backed up if the attachment history is important as it is this file which prevents duplicate attachments being downloaded and processed.
The script also expects to find a column_mapping.json file in it's directory containing CSV columns to XML fields. Unused columns in a CSV can simply be left out and they will be ignored. The title of each json object in this file should match the domain of the sender of the email containing the CSV attachment. The "default" object will be used if no specific match is found.
Here is an example column_mapping.json file:
{
"default": {
"Item_ID": "Item_ID",
"Item_Name": "Item_Name",
"Item_Description": "Item_Description",
"Item_Price": "Item_Price",
"Item_Quantity": "Item_Quantity"
},
"hamiltron.net": {
"Item_ID": "ID",
"Item_Name": "Name",
"Item_Description": "Description",
"Item_Price": "Price",
"Item_Quantity": "Quantity"
}
}
The script will also create the directories temp (used for temporary file
processing) and attachments where CSV files are stored. If it is not
important to keep attachments, the latter folder can be ignored, purged, or
even removed entirely - it will be recreated on next run to no ill effect.
Processed XML files ready for final consumption live in the xml directory
along with the DTD. This directory is also unimportant for backup purposes
depending on your requirements for xml files post-consumption. That said, The
DTD file MUST exist there.
Installation
Because this script uses a python virtual environment, and is designed to run
from cron it should be installed like so (it is assumed there is a
working python3 and pip3 on the system).
- Perform the following steps in your home directory (or some other location where you have filesystem permissions):
git clone https://git.hamiltron.net/greig/email_to_xml.git
cd email_to_xml
pip3 install virtualenv
virtualenv email_to_xml
mkdir xml
touch .env
- Copy your verified DTD file into the xml directory.
- Edit .env and configure the correct values for your setup.
- Create your self-signed certificate and key as follows:
openssl req -x509 -newkey rsa:4096 -keyout one.key -out one.crt -sha256 -days 3650 -nodes -subj "/C=XX/ST=StateName/L=CityName/O=CompanyName/OU=CompanySectionName/CN=CommonNameOrHostname"
If you wish to run the script interactively from the commandline, you will
first need to activate the virtual environment:
. ./email_to_xml/bin/activate
You may now run ./getmail.py - you can see the result of the run in
./getmail.log.
Here is an example crontab line to run the script every 5 minutes:
*/5 * * * * cd ~/email_to_xml && ./email_to_xml/bin/python3 ~/email_to_xml/getmail.py