148 lines
5.4 KiB
Markdown
148 lines
5.4 KiB
Markdown
email_to_xml
|
|
|
|
## Version History
|
|
|
|
2025-01-03: Initial release
|
|
|
|
## TODO
|
|
|
|
* Implement OAUTH2 authentication for GMail / Office365 / etc.
|
|
* Optionally support multiple CSV formats per client
|
|
|
|
## What is it?
|
|
|
|
A relatively simple script designed to grab emailed CSV data from a mailbox,
|
|
and convert it to validated clean XML for import into another system.
|
|
|
|
## How does it work?
|
|
|
|
The ```getmail.py``` script is designed to poll an IMAP mailbox when run.
|
|
|
|
It will find any emails with attachments, marking them as read, and saving
|
|
the attachment to the 'attachments' directory for later processing.
|
|
|
|
Attachments are output named with the current date-time, and a semi-random
|
|
uid based on the file hash in order to prevent namespace collisions.
|
|
|
|
Attachments are expected to be CSV format, but can be in various field orders,
|
|
though a header row is required.
|
|
|
|
No file locking is used, however files are written to a temporary directory
|
|
first, flushed, and renamed upon completion, as renaming is an atomic operation
|
|
at an OS level.
|
|
|
|
Attachments will not be created if they have an identical hash to a
|
|
previously downloaded attachment. This is designed to prevent scenarios where
|
|
the same file has been accidentally sent multiple times. Note that this
|
|
identification is done based on file content, and the name of the file is
|
|
irrelevant.
|
|
|
|
Once an attachment is saved, we look up the mapping of fields in the attachment
|
|
for input, using a custom mapping based on the domain of the email address of
|
|
the sender. Different companies may use differing CSV formats, but we work on
|
|
the assumption that the XML will need to be the same and based on a well
|
|
defined DTD.
|
|
|
|
After mapping and conversion to XML, the file is validated against the DTD and
|
|
written to the xml directory for consumption by the target software.
|
|
|
|
Logging is fairly primitive and done to a log file in the same directory as
|
|
the script. This could be upgraded to syslog-style logging if required.
|
|
|
|
This is set up for simple IMAP SSL authentication using TLS with implied
|
|
STARTTLS. If manual STARTTLS is required, the MailBox method will need to be
|
|
altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be
|
|
necessary to implement OAUTH2.
|
|
|
|
Authentication and other user configuration is configured in a .env file as
|
|
described below.
|
|
|
|
## How is it configured?
|
|
|
|
Here is an example .env file:
|
|
|
|
```
|
|
MBOX_USER = 'testmail'
|
|
MBOX_PASS = 'supersecretpassword'
|
|
MAIL_HOST = 'imap.some.server.com'
|
|
MAIL_PORT = 993
|
|
DTD = 'items.dtd'
|
|
```
|
|
|
|
Note that the DTD file should be provided as a name only but is expected to be
|
|
found in the ```xml``` subdirectory.
|
|
|
|
When run for the first time (or more precisely, when the first attachment is
|
|
downloaded) a saved_hashes.json file will be created in this script's
|
|
directory. This file should be backed up if the attachment history is important
|
|
as it is this file which prevents duplicate attachments being downloaded and
|
|
processed.
|
|
|
|
The script also expects to find a column_mapping.json file in it's directory
|
|
containing CSV columns to XML fields. Unused columns in a CSV can simply be
|
|
left out and they will be ignored. The title of each json object in this file
|
|
should match the domain of the sender of the email containing the CSV
|
|
attachment. The "default" object will be used if no specific match is found.
|
|
|
|
Here is an example ```column_mapping.json``` file:
|
|
|
|
```
|
|
{
|
|
"default": {
|
|
"Item_ID": "Item_ID",
|
|
"Item_Name": "Item_Name",
|
|
"Item_Description": "Item_Description",
|
|
"Item_Price": "Item_Price",
|
|
"Item_Quantity": "Item_Quantity"
|
|
},
|
|
"hamiltron.net": {
|
|
"Item_ID": "ID",
|
|
"Item_Name": "Name",
|
|
"Item_Description": "Description",
|
|
"Item_Price": "Price",
|
|
"Item_Quantity": "Quantity"
|
|
}
|
|
}
|
|
```
|
|
|
|
The script will also create the directories ```temp``` (used for temporary file
|
|
processing) and ```attachments``` where CSV files are stored. If it is not
|
|
important to keep attachments, the latter folder can be ignored, purged, or
|
|
even removed entirely - it will be recreated on next run to no ill effect.
|
|
|
|
Processed XML files ready for final consumption live in the ```xml``` directory
|
|
along with the DTD. This directory is also unimportant for backup purposes
|
|
depending on your requirements for xml files post-consumption. That said, The
|
|
DTD file __MUST__ exist there.
|
|
|
|
## Installation
|
|
|
|
Because this script uses a python virtual environment, and is designed to run
|
|
from ```cron``` it should be installed like so (it is assumed there is a
|
|
working python3 and pip3 on the system).
|
|
|
|
* Perform the following steps in your home directory (or some other location
|
|
where you have filesystem permissions):
|
|
```
|
|
git clone https://git.hamiltron.net/greig/email_to_xml.git
|
|
cd email_to_xml
|
|
pip3 install virtualenv
|
|
virtualenv email_to_xml
|
|
mkdir xml
|
|
touch .env
|
|
```
|
|
|
|
* Copy your verified DTD file into the xml directory.
|
|
* Edit .env and configure the correct values for your setup.
|
|
* Create your self-signed certificate and key as follows:
|
|
```openssl req -x509 -newkey rsa:4096 -keyout one.key -out one.crt -sha256 -days 3650 -nodes -subj "/C=XX/ST=StateName/L=CityName/O=CompanyName/OU=CompanySectionName/CN=CommonNameOrHostname"```
|
|
|
|
If you wish to run the script interactively from the commandline, you will
|
|
first need to activate the virtual environment:
|
|
```. ./email_to_xml/bin/activate```
|
|
You may now run ```./getmail.py``` - you can see the result of the run in
|
|
```./getmail.log```.
|
|
|
|
Here is an example crontab line to run the script every 5 minutes:
|
|
```*/5 * * * * cd ~/email_to_xml && ./email_to_xml/bin/python3 ~/email_to_xml/getmail.py```
|