email_to_xml ## Version History 2025-01-03: Initial release ## TODO * Implement OAUTH2 authentication for GMail / Office365 / etc. * Optionally support multiple CSV formats per client ## What is it? A relatively simple script designed to grab emailed CSV data from a mailbox, and convert it to validated clean XML for import into another system. ## How does it work? The ```getmail.py``` script is designed to poll an IMAP mailbox when run. It will find any emails with attachments, marking them as read, and saving the attachment to the 'attachments' directory for later processing. Attachments are output named with the current date-time, and a semi-random uid based on the file hash in order to prevent namespace collisions. Attachments are expected to be CSV format, but can be in various field orders, though a header row is required. No file locking is used, however files are written to a temporary directory first, flushed, and renamed upon completion, as renaming is an atomic operation at an OS level. Attachments will not be created if they have an identical hash to a previously downloaded attachment. This is designed to prevent scenarios where the same file has been accidentally sent multiple times. Note that this identification is done based on file content, and the name of the file is irrelevant. Once an attachment is saved, we look up the mapping of fields in the attachment for input, using a custom mapping based on the domain of the email address of the sender. Different companies may use differing CSV formats, but we work on the assumption that the XML will need to be the same and based on a well defined DTD. After mapping and conversion to XML, the file is validated against the DTD and written to the xml directory for consumption by the target software. Logging is fairly primitive and done to a log file in the same directory as the script. This could be upgraded to syslog-style logging if required. This is set up for simple IMAP SSL authentication using TLS with implied STARTTLS. If manual STARTTLS is required, the MailBox method will need to be altered to MailBoxTls. If Outlook or Gmail or similar are used, it will be necessary to implement OAUTH2. Authentication and other user configuration is configured in a .env file as described below. ## How is it configured? Here is an example .env file: ``` MBOX_USER = 'testmail' MBOX_PASS = 'supersecretpassword' MAIL_HOST = 'imap.some.server.com' MAIL_PORT = 993 DTD = 'items.dtd' ``` Note that the DTD file should be provided as a name only but is expected to be found in the ```xml``` subdirectory. When run for the first time (or more precisely, when the first attachment is downloaded) a saved_hashes.json file will be created in this script's directory. This file should be backed up if the attachment history is important as it is this file which prevents duplicate attachments being downloaded and processed. The script also expects to find a column_mapping.json file in it's directory containing CSV columns to XML fields. Unused columns in a CSV can simply be left out and they will be ignored. The title of each json object in this file should match the domain of the sender of the email containing the CSV attachment. The "default" object will be used if no specific match is found. Here is an example ```column_mapping.json``` file: ``` { "default": { "Item_ID": "Item_ID", "Item_Name": "Item_Name", "Item_Description": "Item_Description", "Item_Price": "Item_Price", "Item_Quantity": "Item_Quantity" }, "hamiltron.net": { "Item_ID": "ID", "Item_Name": "Name", "Item_Description": "Description", "Item_Price": "Price", "Item_Quantity": "Quantity" } } ``` The script will also create the directories ```temp``` (used for temporary file processing) and ```attachments``` where CSV files are stored. If it is not important to keep attachments, the latter folder can be ignored, purged, or even removed entirely - it will be recreated on next run to no ill effect. Processed XML files ready for final consumption live in the ```xml``` directory along with the DTD. This directory is also unimportant for backup purposes depending on your requirements for xml files post-consumption. That said, The DTD file __MUST__ exist there. ## Installation Because this script uses a python virtual environment, and is designed to run from ```cron``` it should be installed like so (it is assumed there is a working python3 and pip3 on the system). * Perform the following steps in your home directory (or some other location where you have filesystem permissions): ``` git clone https://git.hamiltron.net/greig/email_to_xml.git cd email_to_xml pip3 install virtualenv virtualenv email_to_xml mkdir xml touch .env ``` * Copy your verified DTD file into the xml directory. * Edit .env and configure the correct values for your setup. * Create your self-signed certificate and key as follows: ```openssl req -x509 -newkey rsa:4096 -keyout one.key -out one.crt -sha256 -days 3650 -nodes -subj "/C=XX/ST=StateName/L=CityName/O=CompanyName/OU=CompanySectionName/CN=CommonNameOrHostname"``` If you wish to run the script interactively from the commandline, you will first need to activate the virtual environment: ```. ./email_to_xml/bin/activate``` You may now run ```./getmail.py``` - you can see the result of the run in ```./getmail.log```. Here is an example crontab line to run the script every 5 minutes: ```*/5 * * * * cd ~/email_to_xml && ./email_to_xml/bin/python3 ~/email_to_xml/getmail.py```