scistag.filestag.file_source_zip.FileSourceZip¶

class FileSourceZip(source, **params)[source]¶

Bases: FileSource

FileSource implementation for processing zip archives, either stored locally in the cloud.

All you have to do is to provide a zip file’s filename, a bytes object of a zipfile or an already opened zip archive and you can easily iterate through all files or files of a certain type via

for cur_file in FileSourceZip("MyZipFile", mask="*.png"): ...

Parameters

source (str | bytes | zipfile.ZipFile) – The data source. Either a string (pointing to a filename, an URl or another FileStag compatible protocol), the data of a zip archive or an already opened archive.
params – Additional parameters. See FileSource for the full parameter list supported by FileSources.

Methods

`close`	Closes the current file source, e.g.
`encode_file_list`	Encodes the file list so it can be stored on disk
`exists`	Verifies if a file exists.
`from_source`	Auto-detects the required FileSource implementation for a given source path
`get_file_list`	Returns the file list (if available).
`get_file_list_as_df`	Returns the file list as dataframe
`get_statistics`	Returns statistics about the file source if available.
`handle_fetch_file_list`	Called when the file list shall be pre-fetched.
`handle_file_list_filter`	Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.
`handle_get_next_filename`	Returns the filename and the file size of the next file to be processed.
`handle_next`	Returns the next available element
`handle_provide_result`	Provides the file result for the current iterator index
`handle_skip_check`	Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.
`load_file_list`	Tries to load the file list from file
`read_file`	Reads a file from this file source, identified by name.
`reduce_file_list`	Reduces the `file_list` by applying all filters (index_range, max_file_count, filter_callback) in advance.
`save_file_list`	Saves the file list to a file so it can be quickly restored after a restart of the application.
`set_file_list`	Sets a custom file list provided by the user.
`update_file_list`	Call this function if you want to manually update the file list.

Attributes

`__dict__`
`__doc__`
`__module__`
`__weakref__`	list of weak references to the object (if defined)
`source_filename`	The name of the source file
`source_identifier`	The unique identifier
`source_data`	The source data stream
`access_lock`	Multithreading access lock
`zip_archive`	The zip archive which provides the file data

_get_source_identifier()[source]¶

Has to return a unique identifier for this file source which identifies the name of this source in the cache database.

Can for example be the search path and the search mask or parts of the connection string.

Return type: str
Returns: The unique identifier

_read_file_int(filename)[source]¶

Reads a file from this file source, identified by name.

Note: Not all FileSources support direct file access by name, so you should always prefer to just iterate through a FileSource object rather than accessing single files if your FileSource can be freely configured. For example an ImageFileSource pointing to a camera can only provide it’s data frame by frame - and thus image file by image file - and not by name.

Parameters: filename (str) – The name of the file to read
Return type: bytes | None
Returns: The file’s content on success, None otherwise

close()[source]¶: Closes the current file source, e.g. zip archive, streaming connection etc. if applicable

encode_file_list(version=-1)¶

Encodes the file list so it can be stored on disk

Parameters

version (int) –

The user defined version number. It can be passed to enforce updating the list when ever this number is changed.

If -1 is passed the version is ignored.

Return type

bytes

Returns

The encoded file list

exists(filename)[source]¶

Verifies if a file exists.

Note: This function may not be supported by all sources (such as streaming sources)

Parameters: filename (str) – The file to look for
Return type: bool
Returns: True if the file exists

static from_source(source, search_mask='*', search_path='', recursive=True, filter_callback=None, sorting_callback=None, index_filter=None, fetch_file_list=False, max_file_count=-1, file_list_name=None, max_web_cache_age=0.0, dont_load=False)¶

Auto-detects the required FileSource implementation for a given source path

Parameters

source (str | bytes) –
The path you would like to iterate. The following path types are currently supported: * /home/aDirectory: Will return a FileSourceDisk object to iterate

through a directory’s content
- /home/myZipArchive.zip: Will return a FileSourceZip object to
  iterate through a zip archive
- azure://DefaultEndpointsProtocol=https;AccountName=…;AccountKey=…/container/path:
  Will iterate to an Azure Blob Storage.
- A bytes object: Detects the source type and opens it. At the
  moment only zip archive data ia supported.
search_mask (str) – The file name filter mask
search_path (str) – The search path, e.g. directory name or relative path to the zip root, storage root etc.
recursive (bool) – Defines if the search shall be executed recursive. True by default.
filter_callback (FilterCallback | None) – A callback function to call for each file to verify if it shall be processed or ignored. See FilterCallback
sorting_callback (Callable[[FileListEntry], Any] | None) –
A function to be called (and pass into sorted) to sort the file list before it is stored.

Is called for every element and has to return the sorting value, either a string, float or another size comparable data type.

Does only work in combination with fetch_file_list.
index_filter (tuple[int, int] | None) –
The index filter helps splitting a processing task to multiple, threads nodes and/or processes.

The first tuple element defines the total worker count, the second tuple element the current worker index (0 .. worker_count-1). If you want to for example process a zip archive by 4 threads in parallel just spawn 4 threads and pass (4,0) to the first, (4,1) to the second (4,2) to the third and (4,3) to the third.

All four threads can then work in parallel and store their processed data parallel into one or multiple FileSinks which are (at least in

most cases) multi-thread safe.
fetch_file_list (bool) –
If set to true the FileSource will try to iterate all filenames in advance.

This is recommended especially if you are using sources where it’s not guaranteed that the file names will always be provided in the same order and you intend to share a task among multiple threads to guarantee a consistent behavior.
file_list_name (str | tuple[str, int] | None) –
If provided the the file list will be stored in given file so that the files do not need to be iterated over and over again each run (which can save a lot of time).

You can either pass a string, just containing the file name or a tuple of (filename, version) so you can enforce replacing the list when ever you pass a new version number.
max_file_count (int) – The maximum number of files to process (excluding the index filter’s impact)
max_web_cache_age (float) – The count of seconds for how long files from this source may be stored and received from the cache if this source is remote, e.g. Azure, AWS.
dont_load – If set to true the iterator will not provide the file’s content but just iterate the filenames. Helpful if the consumer for example requires a path to files stored on disk.

Return type

FileSource | None

Returns

The FileSource implementation for your path. None if the path can not be identified.

get_file_list()¶

Returns the file list (if available).

Note that the file list is not available for all file sources. Pass fetch_file_list = true to the initializer of all supported FileSources to fetch the list in advance.

Return type: FileList | None
Returns: The list of filenames and their size (so far known).

get_file_list_as_df()¶

Returns the file list as dataframe

Return type: DataFrame
Returns: The file list

get_statistics()¶

Returns statistics about the file source if available.

Requires a valid file list, see get_file_list().

Return type: dict | None
Returns: Dictionary with statistics about file types, total size etc.

handle_fetch_file_list(force=False)[source]¶

Called when the file list shall be pre-fetched.

If your custom FileSource is able to do so populate the self.file_list with a sorted list of all files available and instead of iterating the files live always access the matching file list entry using self.file_list[file_index] appropriately.

Parameters: force (bool) – Enforce an update of the file list, even if it was created before already

handle_file_list_filter(filename)¶

Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.

Increases the file_index upon failure. Does NOT increase it upon success (as provide_result() will do so).

Parameters: filename (str) – The file’s name
Return type: bool
Returns: A valid filename if the file shall be processed, None otherwise.

handle_get_next_filename(iterator)[source]¶

Returns the filename and the file size of the next file to be processed.

Overwrite this method for your own, custom File iterator.

Parameters: iterator (FileSourceIterator) – The file iterator object
Return type: tuple[str, int] | None
Returns: Name and size of the next element as tuple

handle_next(iterator)¶

Returns the next available element

Parameters: iterator (FileSourceIterator) – The iterator object which keeps track of the current processing
Return type: FileSourceElement | None
Returns: The next file object if available

handle_provide_result(iterator, filename, data)¶

Provides the file result for the current iterator index

Parameters

iterator (FileSourceIterator) – The iterator handle
filename (str) – The name of the file to be stored
data (bytes) – The file data

Return type

FileSourceElement

handle_skip_check(file_info)¶

Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.

Increases the file_index upon failure. Does NOT increase it upon success (as provide_result() will do so).

Parameters: file_info (FileIterationData) – Information about the current file
Return type: str | None
Returns: A valid filename if the file shall be processed, None otherwise.

load_file_list(source, version=-1)¶

Tries to load the file list from file

Parameters

source (bytes | str) – The file list source. Any FileStag compatible data source.
version (int) –
The user defined version number. It can be passed to enforce updating the list when ever this number is changed.

If -1 is passed the version is ignored.

Return type

bool

Returns

True if a valid list could be loaded.

read_file(filename)¶

Reads a file from this file source, identified by name.

Note: Not all FileSources support direct file access by name, so you should always prefer to just iterate through a FileSource object rather than accessing single files if your FileSource can be freely configured. For example an ImageFileSource pointing to a camera can only provide it’s data frame by frame - and thus image file by image file - and not by name.

Parameters: filename (str) – The name of the file to read
Return type: bytes | None
Returns: The file’s content on success, None otherwise

reduce_file_list()¶

Reduces the file_list by applying all filters (index_range, max_file_count, filter_callback) in advance. Requires the source being initialized with fetch_file_list in advance and thus requires a non-streaming file source where the full file list is known in advance.

This way you know in advance which files (after all the filters) are really getting processed with your current filtering settings. So the filters are not applied twice this function also disables all callbacks and filter variables after it’s execution.

Return type: list[FileListEntry] | None
Returns: Returns the reduced file list

save_file_list(target, version=-1)¶

Saves the file list to a file so it can be quickly restored after a restart of the application.

Parameters

target (str) – The FileStag compatible file target, e.g. a local file name
version (int) –
The user defined version number. It can be passed to enforce updating the list when ever this number is changed.

If -1 is passed the version is ignored.

set_file_list(new_list)¶

Sets a custom file list provided by the user.

Helpful for large jobs where the total file list is split into several working packages in advance and the shares need to be customized.

Parameters: new_list (list[str] | list[FileListEntry]) – The new list to be assigned. Either a list of “FileListEntry”s with all details or a list of filenames

update_file_list(new_list)¶

Call this function if you want to manually update the file list.

Updates the internal search index and other helper variables.

Parameters: new_list (list[FileListEntry]) – The new list

_file_list: FileList | None¶

A sorted list of all to files (if available e.g. by setting: fetch_file_list=True).

Note that settings such as index_filter and max_file_count have no effect on the file_list by default. You can though explicitly call the method reduce_file_list() which will execute all filters in advance to provide you the final file_list and will disable these variable afterwards.

_file_list_name¶: The name of the file from which the file list shall be loaded

_file_list_version: int¶: The version of the file list to assume. If it mismatches the stored version it will be replaced

_statistics: dict | None¶: The statistics, only available when all files were iterated

access_lock¶: Multithreading access lock

dont_load¶: If set to true the iterator for element in FileSource will not fetch the file’s content but just iterate through it’s filenames

file_set¶: A set containing all known files. Only valid if file_list is available too

filter_callback¶: The filter function which will be called for each file to verify if it shall be processed

index_filter: tuple[int, int] | None¶

The index filter helps splitting a processing task to multiple, threads nodes and/or processes.

See initializer parameter.

is_closed¶: Defines if this file source was closed

max_file_count¶: The maximum number of files to process. (excluding the impact of index_filter

output_filename_list: list[str] | None¶: If defined it provides the output filenames for every file in self.file_list.

recursive¶: Defines if the search shall be executed recursive

search_mask¶: The search mask to match the filenames against before they are returned

search_path¶: The path to search within, e.g. a file path

source_data: bytes | None¶: The source data stream

source_filename¶: The name of the source file

source_identifier¶: The unique identifier

user_data¶: The user data for further customization, e.g. of the filter callback

zip_archive: zipfile.ZipFile¶: The zip archive which provides the file data