scistag.filestag.file_source_zip.FileSourceZip¶
- class FileSourceZip(source, **params)[source]¶
Bases:
FileSourceFileSource implementation for processing zip archives, either stored locally in the cloud.
All you have to do is to provide a zip file’s filename, a bytes object of a zipfile or an already opened zip archive and you can easily iterate through all files or files of a certain type via
for cur_file in FileSourceZip("MyZipFile", mask="*.png"): ...- Parameters
source (str | bytes | zipfile.ZipFile) – The data source. Either a string (pointing to a filename, an URl or another FileStag compatible protocol), the data of a zip archive or an already opened archive.
params – Additional parameters. See
FileSourcefor the full parameter list supported by FileSources.
Methods
Closes the current file source, e.g.
Encodes the file list so it can be stored on disk
Verifies if a file exists.
Auto-detects the required FileSource implementation for a given source path
Returns the file list (if available).
Returns the file list as dataframe
Returns statistics about the file source if available.
Called when the file list shall be pre-fetched.
Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.
Returns the filename and the file size of the next file to be processed.
Returns the next available element
Provides the file result for the current iterator index
Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.
Tries to load the file list from file
Reads a file from this file source, identified by name.
Reduces the
file_listby applying all filters (index_range, max_file_count, filter_callback) in advance.Saves the file list to a file so it can be quickly restored after a restart of the application.
Sets a custom file list provided by the user.
Call this function if you want to manually update the file list.
Attributes
__dict____doc____module____weakref__list of weak references to the object (if defined)
The name of the source file
The unique identifier
The source data stream
Multithreading access lock
The zip archive which provides the file data
- _get_source_identifier()[source]¶
Has to return a unique identifier for this file source which identifies the name of this source in the cache database.
Can for example be the search path and the search mask or parts of the connection string.
- Return type
- Returns
The unique identifier
- _read_file_int(filename)[source]¶
Reads a file from this file source, identified by name.
Note: Not all FileSources support direct file access by name, so you should always prefer to just iterate through a FileSource object rather than accessing single files if your FileSource can be freely configured. For example an
ImageFileSourcepointing to a camera can only provide it’s data frame by frame - and thus image file by image file - and not by name.
- close()[source]¶
Closes the current file source, e.g. zip archive, streaming connection etc. if applicable
- encode_file_list(version=-1)¶
Encodes the file list so it can be stored on disk
- exists(filename)[source]¶
Verifies if a file exists.
Note: This function may not be supported by all sources (such as streaming sources)
- static from_source(source, search_mask='*', search_path='', recursive=True, filter_callback=None, sorting_callback=None, index_filter=None, fetch_file_list=False, max_file_count=-1, file_list_name=None, max_web_cache_age=0.0, dont_load=False)¶
Auto-detects the required FileSource implementation for a given source path
- Parameters
The path you would like to iterate. The following path types are currently supported: * /home/aDirectory: Will return a FileSourceDisk object to iterate
through a directory’s content
- /home/myZipArchive.zip: Will return a FileSourceZip object to
iterate through a zip archive
- azure://DefaultEndpointsProtocol=https;AccountName=…;AccountKey=…/container/path:
Will iterate to an Azure Blob Storage.
- A bytes object: Detects the source type and opens it. At the
moment only zip archive data ia supported.
search_mask (str) – The file name filter mask
search_path (str) – The search path, e.g. directory name or relative path to the zip root, storage root etc.
recursive (bool) – Defines if the search shall be executed recursive. True by default.
filter_callback (FilterCallback | None) – A callback function to call for each file to verify if it shall be processed or ignored. See
FilterCallbacksorting_callback (Callable[[FileListEntry], Any] | None) –
A function to be called (and pass into sorted) to sort the file list before it is stored.
Is called for every element and has to return the sorting value, either a string, float or another size comparable data type.
Does only work in combination with fetch_file_list.
index_filter (tuple[int, int] | None) –
The index filter helps splitting a processing task to multiple, threads nodes and/or processes.
The first tuple element defines the total worker count, the second tuple element the current worker index (0 .. worker_count-1). If you want to for example process a zip archive by 4 threads in parallel just spawn 4 threads and pass (4,0) to the first, (4,1) to the second (4,2) to the third and (4,3) to the third.
All four threads can then work in parallel and store their processed data parallel into one or multiple FileSinks which are (at least in
most cases) multi-thread safe.
fetch_file_list (bool) –
If set to true the FileSource will try to iterate all filenames in advance.
This is recommended especially if you are using sources where it’s not guaranteed that the file names will always be provided in the same order and you intend to share a task among multiple threads to guarantee a consistent behavior.
file_list_name (str | tuple[str, int] | None) –
If provided the the file list will be stored in given file so that the files do not need to be iterated over and over again each run (which can save a lot of time).
You can either pass a string, just containing the file name or a tuple of (filename, version) so you can enforce replacing the list when ever you pass a new version number.
max_file_count (int) – The maximum number of files to process (excluding the index filter’s impact)
max_web_cache_age (float) – The count of seconds for how long files from this source may be stored and received from the cache if this source is remote, e.g. Azure, AWS.
dont_load – If set to true the iterator will not provide the file’s content but just iterate the filenames. Helpful if the consumer for example requires a path to files stored on disk.
- Return type
FileSource | None
- Returns
The FileSource implementation for your path. None if the path can not be identified.
- get_file_list()¶
Returns the file list (if available).
Note that the file list is not available for all file sources. Pass fetch_file_list = true to the initializer of all supported FileSources to fetch the list in advance.
- Return type
FileList | None
- Returns
The list of filenames and their size (so far known).
- get_file_list_as_df()¶
Returns the file list as dataframe
- Return type
DataFrame- Returns
The file list
- get_statistics()¶
Returns statistics about the file source if available.
Requires a valid file list, see
get_file_list().- Return type
dict | None
- Returns
Dictionary with statistics about file types, total size etc.
- handle_fetch_file_list(force=False)[source]¶
Called when the file list shall be pre-fetched.
If your custom FileSource is able to do so populate the self.file_list with a sorted list of all files available and instead of iterating the files live always access the matching file list entry using self.file_list[file_index] appropriately.
- Parameters
force (
bool) – Enforce an update of the file list, even if it was created before already
- handle_file_list_filter(filename)¶
Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.
Increases the file_index upon failure. Does NOT increase it upon success (as
provide_result()will do so).
- handle_get_next_filename(iterator)[source]¶
Returns the filename and the file size of the next file to be processed.
Overwrite this method for your own, custom File iterator.
- Parameters
iterator (FileSourceIterator) – The file iterator object
- Return type
- Returns
Name and size of the next element as tuple
- handle_next(iterator)¶
Returns the next available element
- Parameters
iterator (FileSourceIterator) – The iterator object which keeps track of the current processing
- Return type
FileSourceElement | None
- Returns
The next file object if available
- handle_provide_result(iterator, filename, data)¶
Provides the file result for the current iterator index
- Parameters
iterator (
FileSourceIterator) – The iterator handlefilename (
str) – The name of the file to be storeddata (
bytes) – The file data
- Return type
- handle_skip_check(file_info)¶
Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.
Increases the file_index upon failure. Does NOT increase it upon success (as
provide_result()will do so).- Parameters
file_info (FileIterationData) – Information about the current file
- Return type
str | None
- Returns
A valid filename if the file shall be processed, None otherwise.
- load_file_list(source, version=-1)¶
Tries to load the file list from file
- Parameters
- Return type
- Returns
True if a valid list could be loaded.
- read_file(filename)¶
Reads a file from this file source, identified by name.
Note: Not all FileSources support direct file access by name, so you should always prefer to just iterate through a FileSource object rather than accessing single files if your FileSource can be freely configured. For example an
ImageFileSourcepointing to a camera can only provide it’s data frame by frame - and thus image file by image file - and not by name.
- reduce_file_list()¶
Reduces the
file_listby applying all filters (index_range, max_file_count, filter_callback) in advance. Requires the source being initialized with fetch_file_list in advance and thus requires a non-streaming file source where the full file list is known in advance.This way you know in advance which files (after all the filters) are really getting processed with your current filtering settings. So the filters are not applied twice this function also disables all callbacks and filter variables after it’s execution.
- Return type
list[FileListEntry] | None
- Returns
Returns the reduced file list
- save_file_list(target, version=-1)¶
Saves the file list to a file so it can be quickly restored after a restart of the application.
- set_file_list(new_list)¶
Sets a custom file list provided by the user.
Helpful for large jobs where the total file list is split into several working packages in advance and the shares need to be customized.
- Parameters
new_list (list[str] | list[FileListEntry]) – The new list to be assigned. Either a list of “FileListEntry”s with all details or a list of filenames
- update_file_list(new_list)¶
Call this function if you want to manually update the file list.
Updates the internal search index and other helper variables.
- Parameters
new_list (
list[FileListEntry]) – The new list
- _file_list: FileList | None¶
- A sorted list of all to files (if available e.g. by setting
fetch_file_list=True).
Note that settings such as
index_filterandmax_file_counthave no effect on the file_list by default. You can though explicitly call the methodreduce_file_list()which will execute all filters in advance to provide you the final file_list and will disable these variable afterwards.
- _file_list_name¶
The name of the file from which the file list shall be loaded
- _file_list_version: int¶
The version of the file list to assume. If it mismatches the stored version it will be replaced
- access_lock¶
Multithreading access lock
- dont_load¶
If set to true the iterator
for element in FileSourcewill not fetch the file’s content but just iterate through it’s filenames
- file_set¶
A set containing all known files. Only valid if file_list is available too
- filter_callback¶
The filter function which will be called for each file to verify if it shall be processed
- index_filter: tuple[int, int] | None¶
The index filter helps splitting a processing task to multiple, threads nodes and/or processes.
See initializer parameter.
- is_closed¶
Defines if this file source was closed
- max_file_count¶
The maximum number of files to process. (excluding the impact of
index_filter
- output_filename_list: list[str] | None¶
If defined it provides the output filenames for every file in self.file_list.
- recursive¶
Defines if the search shall be executed recursive
- search_mask¶
The search mask to match the filenames against before they are returned
- search_path¶
The path to search within, e.g. a file path
- source_filename¶
The name of the source file
- source_identifier¶
The unique identifier
- user_data¶
The user data for further customization, e.g. of the filter callback
- zip_archive: zipfile.ZipFile¶
The zip archive which provides the file data