Moved docs to README.md
This commit is contained in:
107
README.md
Normal file
107
README.md
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
email_to_xml
|
||||||
|
|
||||||
|
## What is it?
|
||||||
|
|
||||||
|
A relatively simple script designed to grab emailed CSV data from a mailbox,
|
||||||
|
and convert it to validated clean XML for import into another system.
|
||||||
|
|
||||||
|
## How does it work?
|
||||||
|
|
||||||
|
The ```getmail.py``` script is designed to poll an IMAP mailbox when run.
|
||||||
|
|
||||||
|
It will find any emails with attachments, marking them as read, and saving
|
||||||
|
the attachment to the 'attachments' directory for later processing.
|
||||||
|
|
||||||
|
Attachments are output named with the current date-time, and a semi-random
|
||||||
|
uid based on the file hash in order to prevent namespace collisions.
|
||||||
|
|
||||||
|
Attachments are expected to be CSV format, but can be in various field orders,
|
||||||
|
though a header row is required.
|
||||||
|
|
||||||
|
No file locking is used, however files are written to a temporary directory
|
||||||
|
first, flushed, and renamed upon completion, as renaming is an atomic operation
|
||||||
|
at an OS level.
|
||||||
|
|
||||||
|
Attachments will not be created if they have an identical hash to a
|
||||||
|
previously downloaded attachment. This is designed to prevent scenarios where
|
||||||
|
the same file has been accidentally sent multiple times. Note that this
|
||||||
|
identification is done based on file content, and the name of the file is
|
||||||
|
irrelevant.
|
||||||
|
|
||||||
|
Once an attachment is saved, we look up the mapping of fields in the attachment
|
||||||
|
for input, using a custom mapping based on the domain of the email address of
|
||||||
|
the sender. Different companies may use differing CSV formats, but we work on
|
||||||
|
the assumption that the XML will need to be the same and based on a well
|
||||||
|
defined DTD.
|
||||||
|
|
||||||
|
After mapping and conversion to XML, the file is validated against the DTD and
|
||||||
|
written to the xml directory for consumption by the target software.
|
||||||
|
|
||||||
|
Logging is fairly primitive and done to a log file in the same directory as
|
||||||
|
the script. This could be upgraded to syslog-style logging if required.
|
||||||
|
|
||||||
|
This is set up for simple IMAP SSL authentication using TLS with implied
|
||||||
|
STARTTLS. If manual STARTTLS is required, the MailBox method will need to be
|
||||||
|
altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be
|
||||||
|
necessary to implement OAUTH2.
|
||||||
|
|
||||||
|
Authentication and other user configuration is configured in a .env file as
|
||||||
|
described below.
|
||||||
|
|
||||||
|
## How is it configured?
|
||||||
|
|
||||||
|
Here is an example .env file:
|
||||||
|
|
||||||
|
```
|
||||||
|
MBOX_USER = 'testmail'
|
||||||
|
MBOX_PASS = 'supersecretpassword'
|
||||||
|
MAIL_HOST = 'imap.some.server.com'
|
||||||
|
MAIL_PORT = 993
|
||||||
|
DTD = 'items.dtd'
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that the DTD file should be provided as a name only but is expected to be
|
||||||
|
found in the ```xml``` subdirectory.
|
||||||
|
|
||||||
|
When run for the first time (or more precisely, when the first attachment is
|
||||||
|
downloaded) a saved_hashes.json file will be created in this script's
|
||||||
|
directory. This file should be backed up if the attachment history is important
|
||||||
|
as it is this file which prevents duplicate attachments being downloaded and
|
||||||
|
processed.
|
||||||
|
|
||||||
|
The script also expects to find a column_mapping.json file in it's directory
|
||||||
|
containing CSV columns to XML fields. Unused columns in a CSV can simply be
|
||||||
|
left out and they will be ignored. The title of each json object in this file
|
||||||
|
should match the domain of the sender of the email containing the CSV
|
||||||
|
attachment. The "default" object will be used if no specific match is found.
|
||||||
|
|
||||||
|
Here is an example ```column_mapping.json``` file:
|
||||||
|
|
||||||
|
```
|
||||||
|
{
|
||||||
|
"default": {
|
||||||
|
"Item_ID": "Item_ID",
|
||||||
|
"Item_Name": "Item_Name",
|
||||||
|
"Item_Description": "Item_Description",
|
||||||
|
"Item_Price": "Item_Price",
|
||||||
|
"Item_Quantity": "Item_Quantity"
|
||||||
|
},
|
||||||
|
"hamiltron.net": {
|
||||||
|
"Item_ID": "ID",
|
||||||
|
"Item_Name": "Name",
|
||||||
|
"Item_Description": "Description",
|
||||||
|
"Item_Price": "Price",
|
||||||
|
"Item_Quantity": "Quantity"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The script will also create the directories ```temp``` (used for temporary file
|
||||||
|
processing) and ```attachments``` where CSV files are stored. If it is not
|
||||||
|
important to keep attachments, the latter folder can be ignored, purged, or
|
||||||
|
even removed entirely - it will be recreated on next run to no ill effect.
|
||||||
|
|
||||||
|
Processed XML files ready for final consumption live in the ```xml``` directory
|
||||||
|
along with the DTD. This directory is also unimportant for backup purposes
|
||||||
|
depending on your requirements for xml files post-consumption. That said, The
|
||||||
|
DTD file __MUST__ exist there.
|
||||||
37
getmail.py
37
getmail.py
@@ -3,42 +3,7 @@
|
|||||||
"""
|
"""
|
||||||
Developed by Greig McGill of Sense7.
|
Developed by Greig McGill of Sense7.
|
||||||
|
|
||||||
This script is designed to poll an IMAP mailbox when run.
|
See README.md for full documentation and version history.
|
||||||
It will find any emails with attachments, marking them as read, and saving
|
|
||||||
the attachment to the 'attachments' directory for later processing.
|
|
||||||
It will respect the attachment filetype and extension.
|
|
||||||
Attachments are output named with the current date-time, and a semi-random
|
|
||||||
uid based on the file hash in order to prevent namespace collisions.
|
|
||||||
|
|
||||||
No file locking is used, however files are written to a temporary directory
|
|
||||||
first, flushed, and renamed upon completion, as renaming is an atomic operation
|
|
||||||
at an OS level.
|
|
||||||
|
|
||||||
Attachments will not be created if they have an identical hash to a
|
|
||||||
previously downloaded attachment. This is designed to prevent scenarios where
|
|
||||||
the same file has been accidentally sent multiple times. Note that this
|
|
||||||
identification is done based on file content, and the name of the file is
|
|
||||||
irrelevant.
|
|
||||||
|
|
||||||
Once an attachment is saved, we look up the mapping of fields in the attachment
|
|
||||||
for input, using a custom mapping based on the domain of the email address of
|
|
||||||
the sender. Different companies may use differing CSV formats, but we work on
|
|
||||||
the assumption tha the XML for input into Logistic will need to be the same and
|
|
||||||
based on a well defined DTD.
|
|
||||||
|
|
||||||
After mapping and conversion to XML, the file is validated against the DTD and
|
|
||||||
written to the xml directory for injection into logistic.
|
|
||||||
|
|
||||||
Logging is fairly primitive and done to a log file in the same directory as
|
|
||||||
the script. This could be upgraded to syslog-style logging if required.
|
|
||||||
|
|
||||||
This is set up for simple IMAP SSL authentication using TLS with implied
|
|
||||||
STARTTLS. If manual STARTTLS is required, the MailBox method will need to be
|
|
||||||
altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be
|
|
||||||
necessary to implement OAUTH2.
|
|
||||||
|
|
||||||
Authentication and other user configuration is configured in a .env file as
|
|
||||||
described below in the code.
|
|
||||||
"""
|
"""
|
||||||
|
|
||||||
# Standard libraries
|
# Standard libraries
|
||||||
|
|||||||
Reference in New Issue
Block a user