records_mover.records.sources package

Module contents

class records_mover.records.sources.RecordsSources(db_driver, url_resolver)

Bases: object

These methods produce objects representing the source of a records move. The objects can be used as the ‘source’ argument to records_mover.records.move()

This object should be pulled from the ‘sources’ property of the ‘records’ property on a records_mover.Session object instead of being constructed directly.

Example use:

records = session.records
db_engine = session.get_default_db_engine()
url = 's3://some-bucket/some-directory/'
source = records.sources.directory_from_url(url=url)
target = records.targets.table(schema_name='myschema',
                               table_name='mytable',
                               db_engine=db_engine)
results = records.move(source, target)
Parameters
  • db_driver (Callable[[Optional[Union[Engine, Connection]], Optional[Connection], Optional[Engine]], DBDriver]) –

  • url_resolver (UrlResolver) –

dataframe(df, processing_instructions=<records_mover.records.processing_instructions.ProcessingInstructions object>, records_schema=None, include_index=False)

Represents a single dataframe source.

Parameters
  • df (DataFrame) – Pandas DataFrame to move data from.

  • processing_instructions (ProcessingInstructions) – Instructions used during creation of the schema SQL as a records_mover.records.ProcessingInstructions object.

  • include_index (bool) – If True, the Pandas DataFrame index column will be included in the move as a column; if False, it will be disregarded.

  • records_schema (Optional[RecordsSchema]) – Experimental interface; do not use.

Return type

DataframesRecordsSource

dataframes(dfs, processing_instructions=<records_mover.records.processing_instructions.ProcessingInstructions object>, records_schema=None, include_index=False)

Represents multiple dataframes as a source. Note that this accepts an iterable, meaning that the dataframes in question can be generated dynamically in chunks.

Parameters
  • dfs (Iterable[DataFrame]) – Iterable of Pandas DataFrames to move data from – all data from these DataFrames will be added to the same table.

  • processing_instructions (ProcessingInstructions) – Instructions used during creation of the schema SQL as a records_mover.records.ProcessingInstructions object.

  • include_index (bool) – If True, the Pandas DataFrame index column will be included in the move as a column; if False, it will be disregarded.

  • records_schema (Optional[RecordsSchema]) – Experimental interface; do not use.

Return type

DataframesRecordsSource

fileobjs(target_names_to_input_fileobjs, records_format=None, initial_hints=None, records_schema=None)

Represents one or more streams of data files as a source.

Parameters
  • target_names_to_input_fileobjs (Mapping[str, IO[bytes]]) – Filenames mapping to streams of data file.

  • records_format (Optional[BaseRecordsFormat]) – Description of the format of the data files.

  • initial_hints (Optional[PartialRecordsHints]) – If records_format is not provided, the format of the file will be determined automatically. If that effort fails, you can help it out by providing hints in this dictionary as needed. See the records format specification for hints and valid values.

  • records_schema (Optional[RecordsSchema]) – Experimental interface; do not use.

Return type

Union[UninferredFileobjsRecordsSource, FileobjsSource]

data_url(input_url, records_format=None, initial_hints=None, records_schema=None)

Represents a URL pointer to a data file as a source.

Parameters
  • input_url (str) – Location of the data file. Must be a URL format understood by the records_mover.url library.

  • records_format (Optional[BaseRecordsFormat]) – Description of the format of the data files.

  • initial_hints (Optional[PartialRecordsHints]) –

    If records_format is not provided, the format of the file will be determined automatically. If that effort fails, you can help it out by providing hints in this dictionary as needed. See the records format specification for hints and valid values.

  • records_schema (Optional[RecordsSchema]) – Experimental interface; do not use.

Return type

DataUrlRecordsSource

table(db_engine, schema_name, table_name, db_conn=None)

Represents a SQLALchemy-accessible database table as as a source.

Parameters
  • db_engine (Engine) – SQLAlchemy database engine to pull data from.

  • schema_name (str) – Schema name of a table to get data from.

  • table_name (str) – Table name of a table to get data from.

  • db_conn (Optional[Connection]) – SQLAlchemy database connection to use to pull data from.

Return type

TableRecordsSource

directory_from_url(url, hints={}, fail_if_dont_understand=True)

Represents a Records Directory pointed to by a URL as a source.

Parameters
  • url (str) – Location of the records directory. Must be a URL format understood by the records_mover.url library, and must be a directory URL that ends with a ‘/’.

  • hints (PartialRecordsHints) – Any additional hints that should override the description of the data files already in the records directory.

  • fail_if_dont_understand (bool) – If True, and a part of the RecordsFormat is not understood while processing, then immediately fail and raise an exception. Otherwise, ignore the misunderstood instruction (e.g., ignore the hint, assume default variant, etc etc)

Return type

RecordsDirectoryRecordsSource

local_file(filename, records_format=None, initial_hints=None, records_schema=None)

Represents a data file on the local filesystem as a source.

Parameters
  • filename (str) – File path (relative or absolute) of the data file to load.

  • records_format (Optional[BaseRecordsFormat]) – Description of the format of the data files.

  • initial_hints (Optional[PartialRecordsHints]) –

    If records_format is not provided, the format of the file will be determined automatically. If that effort fails, you can help it out by providing hints in this dictionary as needed. See the records format specification for hints and valid values.

  • records_schema (Optional[RecordsSchema]) –

Return type

DataUrlRecordsSource

google_sheet(spreadsheet_id, sheet_name_or_range, google_cloud_creds, out_of_band_column_headers=None, header_translator=None)

Represents a sheet or range in a Google Sheets spreadsheet as a source, via the Google Sheets API.

Parameters
  • spreadsheet_id (str) – This is the xyz in https://docs.google.com/spreadsheets/d/xyz/edit?ts=5be5b383#gid=abc

  • sheet_name_or_range (str) – This is the label of the particular tab within the Google Sheets spreadsheet where the data should go, or a valid Google Sheets-style range formula

  • google_cloud_creds (google.auth.credentials.Credentials) – This is an object representing Google Cloud Platform access credentials.

  • out_of_band_column_headers (Optional[Iterable[str]]) – If provided, we’ll use these column names instead of the first row of the spreadsheet. If set, the first row will be treated as data.

  • header_translator (Optional[Callable[[str], str]]) – If provided, header names pulled from the sheet will be translated through this function. Not used if out_of_band_column_headers is set.

Return type

GoogleSheetsRecordsSource