records_mover.records package

Submodules

records_mover.records.base_records_format module

class records_mover.records.base_records_format.BaseRecordsFormat

Bases: object

This represents the information needed to be able to parse a set of records data files.

See the records format specification for more detail.

To create an instance, see ParquetRecordsFormat or DelimitedRecordsFormat

Module contents

class records_mover.records.DelimitedRecordsFormat(variant='bluelabs', hints={}, processing_instructions=<records_mover.records.processing_instructions.ProcessingInstructions object>)

Bases: BaseRecordsFormat

Describes records data files in delimited (CSV) format.

Parameters

variant (str) –
hints (Mapping[str, object]) –
processing_instructions (ProcessingInstructions) –

__init__(variant='bluelabs', hints={}, processing_instructions=<records_mover.records.processing_instructions.ProcessingInstructions object>)

See the records format documentation for full details on parameters.

Parameters

variant (str) – For a given type (especially delimited), describe the subtype of the format. For ‘delimited’, valid values include ‘dumb’, ‘csv’, ‘bluelabs’, ‘bigquery’ and ‘vertica’.
hints (PartialRecordsHints) –
Dictionary of names of delimited hints mapping to their values. See the records format specification for hints and valid values.
processing_instructions (records_mover.records.ProcessingInstructions) – Directives on how to handle different situations when processing files.

Return type

None

class records_mover.records.ParquetRecordsFormat

Bases: BaseRecordsFormat

Describes records files in Parquet format

__init__()

Create a new instance of ParquetRecordsFormat

Return type: None

class records_mover.records.ProcessingInstructions(fail_if_dont_understand=True, fail_if_cant_handle_hint=True, fail_if_row_invalid=True, max_inference_rows=1000000, max_failure_rows=None)

Bases: object

Parameters

fail_if_dont_understand (bool) –
fail_if_cant_handle_hint (bool) –
fail_if_row_invalid (bool) –
max_inference_rows (Optional[int]) –
max_failure_rows (Optional[int]) –

__init__(fail_if_dont_understand=True, fail_if_cant_handle_hint=True, fail_if_row_invalid=True, max_inference_rows=1000000, max_failure_rows=None)

Directives on how to handle different situations when processing records. Note that not all vendor mechanisms support this level of configurability; when choosing between optimizing for fast transfer and ability to comply, Records Mover will favor fast transfer.

Parameters

fail_if_dont_understand (bool) – If True, and a part of the RecordsFormat is not understood while processing, then immediately fail and raise an exception. Otherwise, ignore the misunderstood instruction (e.g., ignore the hint, assume default variant, etc etc)
fail_if_cant_handle_hint (bool) – If True, and for whatever reason (e.g., limited options in whatever library/tool/database is being used) a certain hint can’t be handled as specified, raise an exception. Otherwise, ignore the hint and use implementation-specific different behavior.
fail_if_row_invalid (bool) – If True, and a particular row of data in the records file cannot be understood by the library, raise an exception. Otherwise, ignore the row and continue and try to load other rows.
max_failure_rows (Optional[int]) – Sets a tolerance level for number of rows of data in the records file that cannot be understood by the library that should be ignored. After reaching level, raise an exception.
max_inference_rows (Optional[int]) – If the schema is not provided and we need it (e.g., we’re to load the records into a database and there’s no existing table), we’ll figure it out through ‘type inference’ - looking at a bunch of examples of data and building a specific schema that can load those rows. This can take some time, so this parameter controls the maximum number of rows we’ll look at. Higher values will be more likely to result in a schema that can be loaded into, but will take longer to load. If set to None, the entire file will be processed.

Return type

None

records_mover.records.DelimitedVariant

Valid string values for the variant of a delimited records format. Variants specify a default set of parsing hints for how the delimited file is formatted. See the records format specification for semantics of each.

alias of Literal[dumb, csv, bigquery, bluelabs, vertica]

class records_mover.records.AvroRecordsFormat

Bases: BaseRecordsFormat

Describes records files in Avro format

format_type: typing_extensions.Literal[avro, delimited, parquet]

generate_filename(basename)

Parameters: basename (str) –
Return type: str

class records_mover.records.ExistingTableHandling(value)

Bases: Enum

Specifies behavior when an existing table with the same name is found when loading data into a database.

DELETE_AND_OVERWRITE = 1: Delete data transactionally (typically with a SQL DELETE statement) and then add new data to the existing table. The delete and the load will be done in a single transaction if the database allows for that, but note that some do not, especially while using the most efficient load method.

TRUNCATE_AND_OVERWRITE = 2: Remove data from the existing table without regard for transactions, data integrity constraints, triggers or redo logs, typically using a SQL TRUNCATE statement. The specific method depends on the database type. This is typically the fastest way to clear a table, but please read your database documentation first and understand the consequences.

DROP_AND_RECREATE = 3: Remove the target table entirely, typically with a SQL DROP TABLE command.

APPEND = 4: Leave the target table and current data in place, and add data to table. Note that Records Mover uses the most efficient method of loading data into the table, which may not honor triggers and integrity constraints.

exception records_mover.records.RecordsFolderNonEmptyException

Bases: RecordsException

Raised if trying to write records to a non-empty target directory.

exception records_mover.records.RecordsException

Bases: Exception

Base class for all records system exceptions.

class records_mover.records.Records(db_driver=PleaseInfer.token, url_resolver=PleaseInfer.token, session=PleaseInfer.token)

Bases: object

To move records from one place to another, you can use the methods on this object.

This object should be pulled from the ‘records’ property on a records_mover.Session object instead of being constructed directly.

To move data, you can call the records_mover.records.move() method, which is aliased for your convenience on this object.

Example:

records = session.records
db_engine = session.get_default_db_engine()
url = 's3://some-bucket/some-directory/'
source = records.sources.directory_from_url(url=url)
target = records.targets.table(schema_name='myschema',
                               table_name='mytable',
                               db_engine=db_engine)
results = records.move(source, target)

Parameters

db_driver (Union[Callable[[Optional[Union[Engine, Connection]], Optional[Connection], Optional[Engine]], DBDriver], PleaseInfer]) –
url_resolver (Union[UrlResolver, PleaseInfer]) –
session (Union[Session, PleaseInfer]) –

move: Callable: Alias of records_mover.records.move()

sources: RecordsSources: Object containing factory methods to create various sources from which to copy records, of type records_mover.records.sources.RecordsSources

targets: RecordsTargets: Object containing factory methods to create various targets to which records can be copied, of type records_mover.records.targets.RecordsTargets

records_mover.records.move(records_source, records_target, processing_instructions=<records_mover.records.processing_instructions.ProcessingInstructions object>)

Copy records from one location to another. Applies a sequence of possible techniques to do this in an efficient way and respects the preferences set in records_source, records_target and processing_instructions.

Example use:

records = session.records
db_engine = session.get_default_db_engine()
url = 's3://some-bucket/some-directory/'
source = records.sources.directory_from_url(url=url)
target = records.targets.table(schema_name='myschema',
                               table_name='mytable',
                               db_engine=db_engine)
results = records.move(source, target)

Parameters

records_source (RecordsSource) – object returned by a factory method in records_mover.records.sources.RecordsSources which represents the place we’re copying records from.
records_target (RecordsTarget) – object returned by a factory method in records_mover.records.targets.RecordsTargets which represents the place we’re copying records to.
processing_instructions (records_mover.records.ProcessingInstructions) – Directives on how to handle different situations when processing files.

Return type

records_mover.records.MoveResult

class records_mover.records.MoveResult(move_count, output_urls)

Bases: NamedTuple

Represents the result of a move() operation between a records source and a records target.

Note that move_count and output_urls may be empty depending on the nature of the move - e.g., a move to a database doesn’t currently map to a formal URL, and whole-file based moves do not currently count the number of records being moved.

Parameters

move_count (Optional[int]) –
output_urls (Optional[Mapping[str, str]]) –

move_count: Optional[int]: Number of rows moved (Optional[int])

output_urls: Optional[Mapping[str, str]]: A dictionary of short string aliases mapping to URLs of the resulting data (Optional[Mapping[str, str]])

records_mover.records package

Subpackages

Submodules

records_mover.records.base_records_format module

Module contents