From a365286b4ced30baa6e8e5008646ac499e1ccade Mon Sep 17 00:00:00 2001 From: Greig McGill Date: Fri, 3 Jan 2025 06:52:46 +1300 Subject: [PATCH] Moved docs to README.md --- README.md | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++ getmail.py | 37 +----------------- 2 files changed, 108 insertions(+), 36 deletions(-) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..9b44214 --- /dev/null +++ b/README.md @@ -0,0 +1,107 @@ +email_to_xml + +## What is it? + +A relatively simple script designed to grab emailed CSV data from a mailbox, +and convert it to validated clean XML for import into another system. + +## How does it work? + +The ```getmail.py``` script is designed to poll an IMAP mailbox when run. + +It will find any emails with attachments, marking them as read, and saving +the attachment to the 'attachments' directory for later processing. + +Attachments are output named with the current date-time, and a semi-random +uid based on the file hash in order to prevent namespace collisions. + +Attachments are expected to be CSV format, but can be in various field orders, +though a header row is required. + +No file locking is used, however files are written to a temporary directory +first, flushed, and renamed upon completion, as renaming is an atomic operation +at an OS level. + +Attachments will not be created if they have an identical hash to a +previously downloaded attachment. This is designed to prevent scenarios where +the same file has been accidentally sent multiple times. Note that this +identification is done based on file content, and the name of the file is +irrelevant. + +Once an attachment is saved, we look up the mapping of fields in the attachment +for input, using a custom mapping based on the domain of the email address of +the sender. Different companies may use differing CSV formats, but we work on +the assumption that the XML will need to be the same and based on a well +defined DTD. + +After mapping and conversion to XML, the file is validated against the DTD and +written to the xml directory for consumption by the target software. + +Logging is fairly primitive and done to a log file in the same directory as +the script. This could be upgraded to syslog-style logging if required. + +This is set up for simple IMAP SSL authentication using TLS with implied +STARTTLS. If manual STARTTLS is required, the MailBox method will need to be +altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be +necessary to implement OAUTH2. + +Authentication and other user configuration is configured in a .env file as +described below. + +## How is it configured? + +Here is an example .env file: + +``` +MBOX_USER = 'testmail' +MBOX_PASS = 'supersecretpassword' +MAIL_HOST = 'imap.some.server.com' +MAIL_PORT = 993 +DTD = 'items.dtd' +``` + +Note that the DTD file should be provided as a name only but is expected to be +found in the ```xml``` subdirectory. + +When run for the first time (or more precisely, when the first attachment is +downloaded) a saved_hashes.json file will be created in this script's +directory. This file should be backed up if the attachment history is important +as it is this file which prevents duplicate attachments being downloaded and +processed. + +The script also expects to find a column_mapping.json file in it's directory +containing CSV columns to XML fields. Unused columns in a CSV can simply be +left out and they will be ignored. The title of each json object in this file +should match the domain of the sender of the email containing the CSV +attachment. The "default" object will be used if no specific match is found. + +Here is an example ```column_mapping.json``` file: + +``` +{ + "default": { + "Item_ID": "Item_ID", + "Item_Name": "Item_Name", + "Item_Description": "Item_Description", + "Item_Price": "Item_Price", + "Item_Quantity": "Item_Quantity" + }, + "hamiltron.net": { + "Item_ID": "ID", + "Item_Name": "Name", + "Item_Description": "Description", + "Item_Price": "Price", + "Item_Quantity": "Quantity" + } +} +``` + +The script will also create the directories ```temp``` (used for temporary file +processing) and ```attachments``` where CSV files are stored. If it is not +important to keep attachments, the latter folder can be ignored, purged, or +even removed entirely - it will be recreated on next run to no ill effect. + +Processed XML files ready for final consumption live in the ```xml``` directory +along with the DTD. This directory is also unimportant for backup purposes +depending on your requirements for xml files post-consumption. That said, The +DTD file __MUST__ exist there. diff --git a/getmail.py b/getmail.py index 5653fc3..65998fa 100755 --- a/getmail.py +++ b/getmail.py @@ -3,42 +3,7 @@ """ Developed by Greig McGill of Sense7. -This script is designed to poll an IMAP mailbox when run. -It will find any emails with attachments, marking them as read, and saving -the attachment to the 'attachments' directory for later processing. -It will respect the attachment filetype and extension. -Attachments are output named with the current date-time, and a semi-random -uid based on the file hash in order to prevent namespace collisions. - -No file locking is used, however files are written to a temporary directory -first, flushed, and renamed upon completion, as renaming is an atomic operation -at an OS level. - -Attachments will not be created if they have an identical hash to a -previously downloaded attachment. This is designed to prevent scenarios where -the same file has been accidentally sent multiple times. Note that this -identification is done based on file content, and the name of the file is -irrelevant. - -Once an attachment is saved, we look up the mapping of fields in the attachment -for input, using a custom mapping based on the domain of the email address of -the sender. Different companies may use differing CSV formats, but we work on -the assumption tha the XML for input into Logistic will need to be the same and -based on a well defined DTD. - -After mapping and conversion to XML, the file is validated against the DTD and -written to the xml directory for injection into logistic. - -Logging is fairly primitive and done to a log file in the same directory as -the script. This could be upgraded to syslog-style logging if required. - -This is set up for simple IMAP SSL authentication using TLS with implied -STARTTLS. If manual STARTTLS is required, the MailBox method will need to be -altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be -necessary to implement OAUTH2. - -Authentication and other user configuration is configured in a .env file as -described below in the code. +See README.md for full documentation and version history. """ # Standard libraries