ETL in the Cloud
What is ETL?
ETL (Extract Transform Load) is a data-warehousing term for working, transforming and finally loading data .
In ABBYY Timeline ETL is a feature where a user can upload small or large files or multiple files, zipped or unzipped and once uploaded they can work with the uploaded file as if they would browse a database table. That means the user can run operations on the uploaded file. For instance: joining two columns into one or trimming etc.. After transforming the data the user can choose to load the transformed table into a project or multiple projects. ETL is a feature for advanced and big-data uploads. Advanced, because compared to the regular file upload, we do not run type detection automatically and we do not create timelines from the data, so the user needs to understand the data. User can upload compressed or big data files and then work on the raw data before creating timelines within a project.
To open repository click View > Repository.
A repository is an abstract container which is similar to a project. In a repository ABBYY Timeline holds tables and a table is representing one or more data files which the user uploaded. A user is able to have more than one repository and each repository may have multiple tables. The user can switch between repositories and also a repository could have more users just like a project. A repository owner can invite users to collaborate on a repository.
A repository table is an actual database table. In the most simple case a user uploads a CSV file into a repository and ABBYY Timeline create a database table with the contents of the CSV file and then it becomes a repository table.
To switch to the table view, select View > Repository > Tables. Then select the table you want. You can add different operations for each table by clicking the Add operation button and selecting an operation from the list.
To configure the operation, specify the necessary parameters. When you finish setting up the operation you may enable a preview about the operation.
When you choose to see a preview, ABBYY Timeline will run that operation on a temporary data table. You will still have the original data table and see a temporary data table which is only 1000 lines long. The operation previews on this temp table and it resets after every operation preview.
- Change case – Converts the field value to UPPER or lower case. For strings only.
- Change type – Converts the selected field into specified type.
- Combine timestamp – Creates Timestamp field from separate Date and Time fields.
- Create timestamp – Creates timestamp field from a text field using format expression.
- Date add – Add or subtract a date part from a date.
- Date diff – Creates a new field by calculating the difference between two dates.
- Delete – Delete records based on a criteria or all.
- Delete by timelines – Delete timelines based on a criteria.
- Delete column – Delete selected columns.
- Delete duplicates – Delete duplicated records.
- Derive field – Creates a new field by combining several fields and the fixed text.
- Extract substring – Extracts the substring from current substring. For strings only.
- Join – Adds field from another (child) table to parent table.
- Load into project – Loads the timelines to a new or existing project.
- Remove substring – Removes the specified substring from the field value. For strings only.
- Replace substring – Replaces the specified substring with another substring. For strings only.
- Round timestamp – Rounds the timestamp field to the specific units (second, minute, etc.).
- Transpose – Convert single record with multiple selected fields into multiple records.
- Trim – Removes extra spaces on left and right"
As part of the T (transform) in ETL the user can perform various transformations on the data table. A common need is to concatenate two columns into one single column. For example, concatenating a column that has data for a Date without Time with a column which contains a Time. The resulting column is a DateTime field type which is a required data field by ABBYY Timeline.
Once you made some operations on the data table, you can choose to save the sequence of those operations as a to-do list for later re-use on the same file or others which require the same operations.
You may upload:
- A single CSV file
- Multiple CSV files zipped (compressed, road-mapped to later support gzip)
- Single CSV file zipped
A repository may have any number of users and a user could be a part of multiple repositories.
User’s rights in a repository could be one of the following:
- Data manager – can view the data AND upload new data
- Admin – can view the data AND upload new data AND add/remove other users and change their rights (except for Owner)
- Owner – a person who created the project. There’s only one owner for a project at a time. The ownership could be transferred to another user. An owner has all rights of Admin. If the current owner changes his role to any other, he will be prompted to specify the user to transfer ownership of the repository.
Any admin or owner could make another user an admin or remove the admin right from them, except for the Owner. An admin changes the rights of other users by clicking on the drop-down lists in the user grid.
Admin could also add create another user to the project by typing their email and clicking Enter. In such case: New user is created with rights Data Manager.
If the user already has account in ABBYY Timeline, he simple gets email notification about access to a new project.
If user doesn’t have an account, we create new account in Recovery status, then send user an email with the temporary password – exactly the same process as when user recovers the forgotten password.
Here is where the log of all changes to the repository will be recorded. This means all new users, deleted users and the addition and removal of tables from the repository will be documented here.