Moved docs to README.md

This commit is contained in:
2025-01-03 06:52:46 +13:00
parent 2441ca3c26
commit a365286b4c
2 changed files with 108 additions and 36 deletions

107
README.md Normal file
View File

@@ -0,0 +1,107 @@
email_to_xml
## What is it?
A relatively simple script designed to grab emailed CSV data from a mailbox,
and convert it to validated clean XML for import into another system.
## How does it work?
The ```getmail.py``` script is designed to poll an IMAP mailbox when run.
It will find any emails with attachments, marking them as read, and saving
the attachment to the 'attachments' directory for later processing.
Attachments are output named with the current date-time, and a semi-random
uid based on the file hash in order to prevent namespace collisions.
Attachments are expected to be CSV format, but can be in various field orders,
though a header row is required.
No file locking is used, however files are written to a temporary directory
first, flushed, and renamed upon completion, as renaming is an atomic operation
at an OS level.
Attachments will not be created if they have an identical hash to a
previously downloaded attachment. This is designed to prevent scenarios where
the same file has been accidentally sent multiple times. Note that this
identification is done based on file content, and the name of the file is
irrelevant.
Once an attachment is saved, we look up the mapping of fields in the attachment
for input, using a custom mapping based on the domain of the email address of
the sender. Different companies may use differing CSV formats, but we work on
the assumption that the XML will need to be the same and based on a well
defined DTD.
After mapping and conversion to XML, the file is validated against the DTD and
written to the xml directory for consumption by the target software.
Logging is fairly primitive and done to a log file in the same directory as
the script. This could be upgraded to syslog-style logging if required.
This is set up for simple IMAP SSL authentication using TLS with implied
STARTTLS. If manual STARTTLS is required, the MailBox method will need to be
altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be
necessary to implement OAUTH2.
Authentication and other user configuration is configured in a .env file as
described below.
## How is it configured?
Here is an example .env file:
```
MBOX_USER = 'testmail'
MBOX_PASS = 'supersecretpassword'
MAIL_HOST = 'imap.some.server.com'
MAIL_PORT = 993
DTD = 'items.dtd'
```
Note that the DTD file should be provided as a name only but is expected to be
found in the ```xml``` subdirectory.
When run for the first time (or more precisely, when the first attachment is
downloaded) a saved_hashes.json file will be created in this script's
directory. This file should be backed up if the attachment history is important
as it is this file which prevents duplicate attachments being downloaded and
processed.
The script also expects to find a column_mapping.json file in it's directory
containing CSV columns to XML fields. Unused columns in a CSV can simply be
left out and they will be ignored. The title of each json object in this file
should match the domain of the sender of the email containing the CSV
attachment. The "default" object will be used if no specific match is found.
Here is an example ```column_mapping.json``` file:
```
{
"default": {
"Item_ID": "Item_ID",
"Item_Name": "Item_Name",
"Item_Description": "Item_Description",
"Item_Price": "Item_Price",
"Item_Quantity": "Item_Quantity"
},
"hamiltron.net": {
"Item_ID": "ID",
"Item_Name": "Name",
"Item_Description": "Description",
"Item_Price": "Price",
"Item_Quantity": "Quantity"
}
}
```
The script will also create the directories ```temp``` (used for temporary file
processing) and ```attachments``` where CSV files are stored. If it is not
important to keep attachments, the latter folder can be ignored, purged, or
even removed entirely - it will be recreated on next run to no ill effect.
Processed XML files ready for final consumption live in the ```xml``` directory
along with the DTD. This directory is also unimportant for backup purposes
depending on your requirements for xml files post-consumption. That said, The
DTD file __MUST__ exist there.

View File

@@ -3,42 +3,7 @@
""" """
Developed by Greig McGill of Sense7. Developed by Greig McGill of Sense7.
This script is designed to poll an IMAP mailbox when run. See README.md for full documentation and version history.
It will find any emails with attachments, marking them as read, and saving
the attachment to the 'attachments' directory for later processing.
It will respect the attachment filetype and extension.
Attachments are output named with the current date-time, and a semi-random
uid based on the file hash in order to prevent namespace collisions.
No file locking is used, however files are written to a temporary directory
first, flushed, and renamed upon completion, as renaming is an atomic operation
at an OS level.
Attachments will not be created if they have an identical hash to a
previously downloaded attachment. This is designed to prevent scenarios where
the same file has been accidentally sent multiple times. Note that this
identification is done based on file content, and the name of the file is
irrelevant.
Once an attachment is saved, we look up the mapping of fields in the attachment
for input, using a custom mapping based on the domain of the email address of
the sender. Different companies may use differing CSV formats, but we work on
the assumption tha the XML for input into Logistic will need to be the same and
based on a well defined DTD.
After mapping and conversion to XML, the file is validated against the DTD and
written to the xml directory for injection into logistic.
Logging is fairly primitive and done to a log file in the same directory as
the script. This could be upgraded to syslog-style logging if required.
This is set up for simple IMAP SSL authentication using TLS with implied
STARTTLS. If manual STARTTLS is required, the MailBox method will need to be
altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be
necessary to implement OAUTH2.
Authentication and other user configuration is configured in a .env file as
described below in the code.
""" """
# Standard libraries # Standard libraries