scistag.filestag.file_source_zip.FileSourceZip

class FileSourceZip(source, **params)[source]

Bases: FileSource

FileSource implementation for processing zip archives, either stored locally in the cloud.

All you have to do is to provide a zip file’s filename, a bytes object of a zipfile or an already opened zip archive and you can easily iterate through all files or files of a certain type via

for cur_file in FileSourceZip("MyZipFile", mask="*.png"): ...

Parameters
  • source (str | bytes | zipfile.ZipFile) – The data source. Either a string (pointing to a filename, an URl or another FileStag compatible protocol), the data of a zip archive or an already opened archive.

  • params – Additional parameters. See FileSource for the full parameter list supported by FileSources.

Methods

close

Closes the current file source, e.g.

encode_file_list

Encodes the file list so it can be stored on disk

exists

Verifies if a file exists.

from_source

Auto-detects the required FileSource implementation for a given source path

get_file_list

Returns the file list (if available).

get_file_list_as_df

Returns the file list as dataframe

get_statistics

Returns statistics about the file source if available.

handle_fetch_file_list

Called when the file list shall be pre-fetched.

handle_file_list_filter

Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.

handle_get_next_filename

Returns the filename and the file size of the next file to be processed.

handle_next

Returns the next available element

handle_provide_result

Provides the file result for the current iterator index

handle_skip_check

Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.

load_file_list

Tries to load the file list from file

read_file

Reads a file from this file source, identified by name.

reduce_file_list

Reduces the file_list by applying all filters (index_range, max_file_count, filter_callback) in advance.

save_file_list

Saves the file list to a file so it can be quickly restored after a restart of the application.

set_file_list

Sets a custom file list provided by the user.

update_file_list

Call this function if you want to manually update the file list.

Attributes

__dict__

__doc__

__module__

__weakref__

list of weak references to the object (if defined)

source_filename

The name of the source file

source_identifier

The unique identifier

source_data

The source data stream

access_lock

Multithreading access lock

zip_archive

The zip archive which provides the file data

_get_source_identifier()[source]

Has to return a unique identifier for this file source which identifies the name of this source in the cache database.

Can for example be the search path and the search mask or parts of the connection string.

Return type

str

Returns

The unique identifier

_read_file_int(filename)[source]

Reads a file from this file source, identified by name.

Note: Not all FileSources support direct file access by name, so you should always prefer to just iterate through a FileSource object rather than accessing single files if your FileSource can be freely configured. For example an ImageFileSource pointing to a camera can only provide it’s data frame by frame - and thus image file by image file - and not by name.

Parameters

filename (str) – The name of the file to read

Return type

bytes | None

Returns

The file’s content on success, None otherwise

close()[source]

Closes the current file source, e.g. zip archive, streaming connection etc. if applicable

encode_file_list(version=-1)

Encodes the file list so it can be stored on disk

Parameters

version (int) –

The user defined version number. It can be passed to enforce updating the list when ever this number is changed.

If -1 is passed the version is ignored.

Return type

bytes

Returns

The encoded file list

exists(filename)[source]

Verifies if a file exists.

Note: This function may not be supported by all sources (such as streaming sources)

Parameters

filename (str) – The file to look for

Return type

bool

Returns

True if the file exists

static from_source(source, search_mask='*', search_path='', recursive=True, filter_callback=None, sorting_callback=None, index_filter=None, fetch_file_list=False, max_file_count=-1, file_list_name=None, max_web_cache_age=0.0, dont_load=False)

Auto-detects the required FileSource implementation for a given source path

Parameters
  • source (str | bytes) –

    The path you would like to iterate. The following path types are currently supported: * /home/aDirectory: Will return a FileSourceDisk object to iterate

    through a directory’s content

    • /home/myZipArchive.zip: Will return a FileSourceZip object to

      iterate through a zip archive

    • azure://DefaultEndpointsProtocol=https;AccountName=…;AccountKey=…/container/path:

      Will iterate to an Azure Blob Storage.

    • A bytes object: Detects the source type and opens it. At the

      moment only zip archive data ia supported.

  • search_mask (str) – The file name filter mask

  • search_path (str) – The search path, e.g. directory name or relative path to the zip root, storage root etc.

  • recursive (bool) – Defines if the search shall be executed recursive. True by default.

  • filter_callback (FilterCallback | None) – A callback function to call for each file to verify if it shall be processed or ignored. See FilterCallback

  • sorting_callback (Callable[[FileListEntry], Any] | None) –

    A function to be called (and pass into sorted) to sort the file list before it is stored.

    Is called for every element and has to return the sorting value, either a string, float or another size comparable data type.

    Does only work in combination with fetch_file_list.

  • index_filter (tuple[int, int] | None) –

    The index filter helps splitting a processing task to multiple, threads nodes and/or processes.

    The first tuple element defines the total worker count, the second tuple element the current worker index (0 .. worker_count-1). If you want to for example process a zip archive by 4 threads in parallel just spawn 4 threads and pass (4,0) to the first, (4,1) to the second (4,2) to the third and (4,3) to the third.

    All four threads can then work in parallel and store their processed data parallel into one or multiple FileSinks which are (at least in

    most cases) multi-thread safe.

  • fetch_file_list (bool) –

    If set to true the FileSource will try to iterate all filenames in advance.

    This is recommended especially if you are using sources where it’s not guaranteed that the file names will always be provided in the same order and you intend to share a task among multiple threads to guarantee a consistent behavior.

  • file_list_name (str | tuple[str, int] | None) –

    If provided the the file list will be stored in given file so that the files do not need to be iterated over and over again each run (which can save a lot of time).

    You can either pass a string, just containing the file name or a tuple of (filename, version) so you can enforce replacing the list when ever you pass a new version number.

  • max_file_count (int) – The maximum number of files to process (excluding the index filter’s impact)

  • max_web_cache_age (float) – The count of seconds for how long files from this source may be stored and received from the cache if this source is remote, e.g. Azure, AWS.

  • dont_load – If set to true the iterator will not provide the file’s content but just iterate the filenames. Helpful if the consumer for example requires a path to files stored on disk.

Return type

FileSource | None

Returns

The FileSource implementation for your path. None if the path can not be identified.

get_file_list()

Returns the file list (if available).

Note that the file list is not available for all file sources. Pass fetch_file_list = true to the initializer of all supported FileSources to fetch the list in advance.

Return type

FileList | None

Returns

The list of filenames and their size (so far known).

get_file_list_as_df()

Returns the file list as dataframe

Return type

DataFrame

Returns

The file list

get_statistics()

Returns statistics about the file source if available.

Requires a valid file list, see get_file_list().

Return type

dict | None

Returns

Dictionary with statistics about file types, total size etc.

handle_fetch_file_list(force=False)[source]

Called when the file list shall be pre-fetched.

If your custom FileSource is able to do so populate the self.file_list with a sorted list of all files available and instead of iterating the files live always access the matching file list entry using self.file_list[file_index] appropriately.

Parameters

force (bool) – Enforce an update of the file list, even if it was created before already

handle_file_list_filter(filename)

Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.

Increases the file_index upon failure. Does NOT increase it upon success (as provide_result() will do so).

Parameters

filename (str) – The file’s name

Return type

bool

Returns

A valid filename if the file shall be processed, None otherwise.

handle_get_next_filename(iterator)[source]

Returns the filename and the file size of the next file to be processed.

Overwrite this method for your own, custom File iterator.

Parameters

iterator (FileSourceIterator) – The file iterator object

Return type

tuple[str, int] | None

Returns

Name and size of the next element as tuple

handle_next(iterator)

Returns the next available element

Parameters

iterator (FileSourceIterator) – The iterator object which keeps track of the current processing

Return type

FileSourceElement | None

Returns

The next file object if available

handle_provide_result(iterator, filename, data)

Provides the file result for the current iterator index

Parameters
  • iterator (FileSourceIterator) – The iterator handle

  • filename (str) – The name of the file to be stored

  • data (bytes) – The file data

Return type

FileSourceElement

handle_skip_check(file_info)

Verifies if the file is valid and shall be processed by comparing it to the file mask, the index_filter etc.

Increases the file_index upon failure. Does NOT increase it upon success (as provide_result() will do so).

Parameters

file_info (FileIterationData) – Information about the current file

Return type

str | None

Returns

A valid filename if the file shall be processed, None otherwise.

load_file_list(source, version=-1)

Tries to load the file list from file

Parameters
  • source (bytes | str) – The file list source. Any FileStag compatible data source.

  • version (int) –

    The user defined version number. It can be passed to enforce updating the list when ever this number is changed.

    If -1 is passed the version is ignored.

Return type

bool

Returns

True if a valid list could be loaded.

read_file(filename)

Reads a file from this file source, identified by name.

Note: Not all FileSources support direct file access by name, so you should always prefer to just iterate through a FileSource object rather than accessing single files if your FileSource can be freely configured. For example an ImageFileSource pointing to a camera can only provide it’s data frame by frame - and thus image file by image file - and not by name.

Parameters

filename (str) – The name of the file to read

Return type

bytes | None

Returns

The file’s content on success, None otherwise

reduce_file_list()

Reduces the file_list by applying all filters (index_range, max_file_count, filter_callback) in advance. Requires the source being initialized with fetch_file_list in advance and thus requires a non-streaming file source where the full file list is known in advance.

This way you know in advance which files (after all the filters) are really getting processed with your current filtering settings. So the filters are not applied twice this function also disables all callbacks and filter variables after it’s execution.

Return type

list[FileListEntry] | None

Returns

Returns the reduced file list

save_file_list(target, version=-1)

Saves the file list to a file so it can be quickly restored after a restart of the application.

Parameters
  • target (str) – The FileStag compatible file target, e.g. a local file name

  • version (int) –

    The user defined version number. It can be passed to enforce updating the list when ever this number is changed.

    If -1 is passed the version is ignored.

set_file_list(new_list)

Sets a custom file list provided by the user.

Helpful for large jobs where the total file list is split into several working packages in advance and the shares need to be customized.

Parameters

new_list (list[str] | list[FileListEntry]) – The new list to be assigned. Either a list of “FileListEntry”s with all details or a list of filenames

update_file_list(new_list)

Call this function if you want to manually update the file list.

Updates the internal search index and other helper variables.

Parameters

new_list (list[FileListEntry]) – The new list

_file_list: FileList | None
A sorted list of all to files (if available e.g. by setting

fetch_file_list=True).

Note that settings such as index_filter and max_file_count have no effect on the file_list by default. You can though explicitly call the method reduce_file_list() which will execute all filters in advance to provide you the final file_list and will disable these variable afterwards.

_file_list_name

The name of the file from which the file list shall be loaded

_file_list_version: int

The version of the file list to assume. If it mismatches the stored version it will be replaced

_statistics: dict | None

The statistics, only available when all files were iterated

access_lock

Multithreading access lock

dont_load

If set to true the iterator for element in FileSource will not fetch the file’s content but just iterate through it’s filenames

file_set

A set containing all known files. Only valid if file_list is available too

filter_callback

The filter function which will be called for each file to verify if it shall be processed

index_filter: tuple[int, int] | None

The index filter helps splitting a processing task to multiple, threads nodes and/or processes.

See initializer parameter.

is_closed

Defines if this file source was closed

max_file_count

The maximum number of files to process. (excluding the impact of index_filter

output_filename_list: list[str] | None

If defined it provides the output filenames for every file in self.file_list.

recursive

Defines if the search shall be executed recursive

search_mask

The search mask to match the filenames against before they are returned

search_path

The path to search within, e.g. a file path

source_data: bytes | None

The source data stream

source_filename

The name of the source file

source_identifier

The unique identifier

user_data

The user data for further customization, e.g. of the filter callback

zip_archive: zipfile.ZipFile

The zip archive which provides the file data