records_mover.records.sources package

Module contents

class records_mover.records.sources.RecordsSources(db_driver, url_resolver)

Bases: object

These methods produce objects representing the source of a records move. The objects can be used as the ‘source’ argument to records_mover.records.move()

This object should be pulled from the ‘sources’ property of the ‘records’ property on a records_mover.Session object instead of being constructed directly.

Example use:

records = session.records
db_engine = session.get_default_db_engine()
url = 's3://some-bucket/some-directory/'
source = records.sources.directory_from_url(url=url)
target = records.targets.table(schema_name='myschema',
                               table_name='mytable',
                               db_engine=db_engine)
results = records.move(source, target)

Parameters

db_driver (Callable[[Optional[Union[Engine, Connection]], Optional[Connection], Optional[Engine]], DBDriver]) –
url_resolver (UrlResolver) –

dataframe(df, processing_instructions=<records_mover.records.processing_instructions.ProcessingInstructions object>, records_schema=None, include_index=False)

Represents a single dataframe source.

Parameters

df (DataFrame) – Pandas DataFrame to move data from.
processing_instructions (ProcessingInstructions) – Instructions used during creation of the schema SQL as a records_mover.records.ProcessingInstructions object.
include_index (bool) – If True, the Pandas DataFrame index column will be included in the move as a column; if False, it will be disregarded.
records_schema (Optional[RecordsSchema]) – Experimental interface; do not use.

Return type

DataframesRecordsSource

dataframes(dfs, processing_instructions=<records_mover.records.processing_instructions.ProcessingInstructions object>, records_schema=None, include_index=False)

Represents multiple dataframes as a source. Note that this accepts an iterable, meaning that the dataframes in question can be generated dynamically in chunks.

Parameters

dfs (Iterable[DataFrame]) – Iterable of Pandas DataFrames to move data from – all data from these DataFrames will be added to the same table.
processing_instructions (ProcessingInstructions) – Instructions used during creation of the schema SQL as a records_mover.records.ProcessingInstructions object.
include_index (bool) – If True, the Pandas DataFrame index column will be included in the move as a column; if False, it will be disregarded.
records_schema (Optional[RecordsSchema]) – Experimental interface; do not use.

Return type

DataframesRecordsSource

fileobjs(target_names_to_input_fileobjs, records_format=None, initial_hints=None, records_schema=None)

Represents one or more streams of data files as a source.

Parameters

target_names_to_input_fileobjs (Mapping[str, IO[bytes]]) – Filenames mapping to streams of data file.
records_format (Optional[BaseRecordsFormat]) – Description of the format of the data files.
initial_hints (Optional[PartialRecordsHints]) – If records_format is not provided, the format of the file will be determined automatically. If that effort fails, you can help it out by providing hints in this dictionary as needed. See the records format specification for hints and valid values.
records_schema (Optional[RecordsSchema]) – Experimental interface; do not use.

Return type

Union[UninferredFileobjsRecordsSource, FileobjsSource]

data_url(input_url, records_format=None, initial_hints=None, records_schema=None)

Represents a URL pointer to a data file as a source.

Parameters

input_url (str) – Location of the data file. Must be a URL format understood by the records_mover.url library.
records_format (Optional[BaseRecordsFormat]) – Description of the format of the data files.
initial_hints (Optional[PartialRecordsHints]) –
If records_format is not provided, the format of the file will be determined automatically. If that effort fails, you can help it out by providing hints in this dictionary as needed. See the records format specification for hints and valid values.
records_schema (Optional[RecordsSchema]) – Experimental interface; do not use.

Return type

DataUrlRecordsSource

table(db_engine, schema_name, table_name, db_conn=None)

Represents a SQLALchemy-accessible database table as as a source.

Parameters

db_engine (Engine) – SQLAlchemy database engine to pull data from.
schema_name (str) – Schema name of a table to get data from.
table_name (str) – Table name of a table to get data from.
db_conn (Optional[Connection]) – SQLAlchemy database connection to use to pull data from.

Return type

TableRecordsSource

directory_from_url(url, hints={}, fail_if_dont_understand=True)

Represents a Records Directory pointed to by a URL as a source.

Parameters

url (str) – Location of the records directory. Must be a URL format understood by the records_mover.url library, and must be a directory URL that ends with a ‘/’.
hints (PartialRecordsHints) – Any additional hints that should override the description of the data files already in the records directory.
fail_if_dont_understand (bool) – If True, and a part of the RecordsFormat is not understood while processing, then immediately fail and raise an exception. Otherwise, ignore the misunderstood instruction (e.g., ignore the hint, assume default variant, etc etc)

Return type

RecordsDirectoryRecordsSource

local_file(filename, records_format=None, initial_hints=None, records_schema=None)

Represents a data file on the local filesystem as a source.

Parameters

filename (str) – File path (relative or absolute) of the data file to load.
records_format (Optional[BaseRecordsFormat]) – Description of the format of the data files.
initial_hints (Optional[PartialRecordsHints]) –
If records_format is not provided, the format of the file will be determined automatically. If that effort fails, you can help it out by providing hints in this dictionary as needed. See the records format specification for hints and valid values.
records_schema (Optional[RecordsSchema]) –

Return type

DataUrlRecordsSource

google_sheet(spreadsheet_id, sheet_name_or_range, google_cloud_creds, out_of_band_column_headers=None, header_translator=None)

Represents a sheet or range in a Google Sheets spreadsheet as a source, via the Google Sheets API.

Parameters

spreadsheet_id (str) – This is the xyz in https://docs.google.com/spreadsheets/d/xyz/edit?ts=5be5b383#gid=abc
sheet_name_or_range (str) – This is the label of the particular tab within the Google Sheets spreadsheet where the data should go, or a valid Google Sheets-style range formula
google_cloud_creds (google.auth.credentials.Credentials) – This is an object representing Google Cloud Platform access credentials.
out_of_band_column_headers (Optional[Iterable[str]]) – If provided, we’ll use these column names instead of the first row of the spreadsheet. If set, the first row will be treated as data.
header_translator (Optional[Callable[[str], str]]) – If provided, header names pulled from the sheet will be translated through this function. Not used if out_of_band_column_headers is set.

Return type

GoogleSheetsRecordsSource