genome_kmers package
kmers
- class genome_kmers.kmers.Kmers(seq_coll: SequenceCollection | None = None, min_kmer_len: int = 1, max_kmer_len: int | None = None, source_strand: str = 'forward', track_strands_separately: bool = False, method: str = 'single_pass')[source]
Bases:
objectDefines memory-efficient kmers calculations on a genome.
- generate_get_kmer_info_func(one_based_seq_index: bool) Callable[source]
Generate the get_kmer_info function that is used to get kmer information from a sequence byte array index.
- Parameters:
one_based_seq_index (bool) – whether to return one-based sequence indices
- Returns:
get_kmer_info
- Return type:
Callable
- get_is_less_than_func(validate_kmers: bool = True, break_ties: bool = False) Callable[source]
Returns a less than function that takes two integers as input and returns whether the kmer defined by the first index is less than the kmer defined by the second index.
NOTE: If break_ties is True, then it will return True if the first of two equal kmers has a smaller sba_start_index. This is useful to gauranteeing identical output between different runs. However, it comes at a significant performance cost due to additional swapping required
- Parameters:
validate_kmers (bool, optional) – Explicitly verify that both kmers are at least of min_kmer_len. Defaults to False.
break_ties (bool, optional) – if two kmers are lexicographically equivalent, break the tie usind the sba_start_index. Defaults to True.
- Raises:
NotImplementedError – kmer_source_strand and strands_loaded must both be “forward”
- Returns:
is_less_than_func
- Return type:
Callable
- get_kmer_count(kmer_len: int | None, kmer_filter_func: ~typing.Callable = CPUDispatcher(<function kmer_filter_keep_all>), min_group_size: int = 1, max_group_size: int | None = None) int[source]
A customizable function to count the total number of kmers passing filters.
Examples:
Count kmers that occur exactly once kmers.get_kmer_count(max_group_size=1)
Count kmers that are repeated at least 5 times and no more than 10 times kmers.get_kmer_count(min_group_size=5, max_group_size=10)
Count 50-mers filter = gen_kmer_length_filter_func(min_kmer_len=50) kmers.get_kmer_count(kmer_filter_func=filter)
- Parameters:
kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.
kmer_filter_func (Callable, optional) – function that returns true if a kmer passes all filters. Defaults to kmer_filter_keep_all.
min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.
max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.
- Raises:
NotImplementedError – if kmer_source_strand and seq_coll.strands() loaded are not both “forward”
ValueError – kmer_len is invalid
ValueError – one or more group params are invalid (min_group_size, max_group_size, yield_first_n)
ValueError – kmer_comparison_func is provided when kmers have not been sorted
- Returns:
total_kmer_count
- Return type:
int
- get_kmer_group_counts(kmer_len: int | None, kmer_filter_func: ~typing.Callable = CPUDispatcher(<function kmer_filter_keep_all>), min_group_size: int = 1, max_group_size: int | None = None, max_counts_bin: int = 1000000) tuple[array, int][source]
A customizable function to build a histogram of group sizes for kmers passing filters.
Examples:
Get histogram for 50-mers filter = gen_kmer_length_filter_func(min_kmer_len=50) kmers.get_kmer_group_counts(kmer_filter_func=filter)
- Parameters:
kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.
kmer_filter_func (Callable, optional) – function that returns true if a kmer passes all filters. Defaults to kmer_filter_keep_all.
min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.
max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.
max_counts_bin (int, optional) – largest group_size bin in the return counts_by_group_size array. Group sizes that exceed this value will be placed in this bin. Defaults to 1000000.
- Raises:
NotImplementedError – if kmer_source_strand and seq_coll.strands() loaded are not
both "forward" –
ValueError – kmer_len is invalid
ValueError – one or more group params are invalid (min_group_size, max_group_size, yield_first_n)
ValueError – kmer_comparison_func is provided when kmers have not been sorted
- Returns:
- counts_by_group_size (np.array): histogram of group sizes
counts_by_group_size[i] gives the number of groups of size i. Any group sizes > max_counts_bin will be placed in max_counts_bin.
total_kmer_count (int): total number of kmers
- Return type:
tuple[np.array, int]
- get_kmer_str(kmer_num: int, kmer_len: int | None = None) str[source]
Get the kmer_num’th kmer of kmer_len.
- Parameters:
kmer_num (int) – which number kmer to return (in range [0, num_kmers - 1])
kmer_len (Union[int, None], optional) – length of kmer to return. If kmer_len is None, return the longest possible, which is when the segment ends or the kmer_max_len is reached. Defaults to None.
- Raises:
NotImplementedError – kmer_source_strand and strands_loaded must both be “forward”
ValueError – kmer_num is invalid
ValueError – kmer_len is invalid
- Returns:
kmer
- Return type:
str
- get_kmer_str_no_checks(kmer_num: int, kmer_strand: str, kmer_len: int) str[source]
Return a string representation of kmer_num on kmer_strand with kmer_len. No checks to verify that arguments provided are done. Only call this function if it is known that these checks have already been completed (e.g. when yielded get_kmers()).
- Parameters:
kmer_num (int) – kmer number
kmer_strand (str) – “+” or “-”
kmer_len (int) – length of the kmer
- Raises:
NotImplementedError – kmer_strand != “+”
ValueError – unrecognized kmer_strand
- Returns:
kmer_str
- Return type:
str
- get_kmers(kmer_len: int | None, one_based_seq_index: bool = False, kmer_filter_func: ~typing.Callable = CPUDispatcher(<function kmer_filter_keep_all>), kmer_info_to_yield: str = 'minimum', min_group_size: int = 1, max_group_size: int | None = None, yield_first_n: int | None = None) Generator[tuple, None, None][source]
A customizable generator yielding tuples with all kmer information.
Examples:
Yield all kmers kmers.get_kmers(yield_first_n=None)
Yield only the first occurrence of a kmer kmers.get_kmers(use yield_first_n=1)
Yield up to the first 10 occurrences of a kmer kmers.get_kmers(use yield_first_n=10)
Yield all kmers that occur exactly once kmers.get_kmers(max_group_size=1)
Yield all kmers that are repeated at least 5 times and no more than 10 times kmers.get_kmers(min_group_size=5, max_group_size=10)
NOTE: group yielding is not supported if kmers are unsorted. The user cannot provide min_group_size, max_group_size, or yield_first_n in this situation.
- Parameters:
kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.
one_based_seq_index (bool, optional) – whether yielded sequence indices should be 1-based. Defaults to False.
kmer_filter_func (Callable, optional) – function that returns true if a kmer passes all filters. Defaults to kmer_filter_keep_all.
kmer_info_to_yield (str) – “minimum” or “full”. Defaults to “minimum”
min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.
max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.
yield_first_n (Union[int, None], optional) – yield up to this many kmer_nums. Defaults to None.
- Raises:
NotImplementedError – if kmer_source_strand and seq_coll.strands() loaded are not both “forward”
ValueError – kmer_len is invalid
ValueError – one or more group params are invalid (min_group_size, max_group_size, yield_first_n)
ValueError – kmer_comparison_func is provided when kmers have not been sorted
- Yields:
Generator[tuple, None, None] – output depends on get_kmer_info_func
- load(load_file_path: Path, seq_coll: SequenceCollection | None = None, format: str = 'hdf5') None[source]
Load Kmers object from saved file.
- Parameters:
load_file_path (Path) – path to file to load.
seq_coll (Union[SequenceCollection, None], optional) – If provided, seq_coll will be loaded into the kmers object rather than attempting to load from the saved file. Defaults to None.
format (str, optional) – “hdf5” or “shelve”. Defaults to “hdf5”.
- Raises:
ValueError – format not recognized
- save(save_file_path: Path, include_sequence_collection: bool = False, format: str = 'hdf5', mode: str = 'w') None[source]
Save Kmers object to file.
- Parameters:
save_file_path (Path) – path for saved file
include_sequence_collection (bool, optional) – whether to include sequence collection. Defaults to False.
format (str, optional) – “hdf5” or “shelve”. Defaults to “hdf5”.
mode (str, optional) – mode with which to open file and save information. “w” for write or “a” for append. Defaults to “w”.
- Raises:
ValueError – format not recognized
- genome_kmers.kmers.compare_sba_kmers_always_less_than(sba_a: array, sba_b: array, kmer_sba_start_idx_a: int, kmer_sba_start_idx_b: int, max_kmer_len: int | None = None) tuple[int, int][source]
- genome_kmers.kmers.compare_sba_kmers_lexicographically(sba_a: array, sba_b: array, kmer_sba_start_idx_a: int, kmer_sba_start_idx_b: int, max_kmer_len: int | None = None) tuple[int, int][source]
Lexicographically compare two kmers of length kmer_len. If kmer_len is None, the end of the segment defines the longest kmer.
NOTE: This function does no validation for kmer_len. It will compare up to max_kmer_len bases if required, but it will return as soon as the comparison result is known.
- Parameters:
sba_a (np.array) – sequence byte array a
sba_b (np.array) – sequence byte array b
kmer_sba_start_idx_a (int) – index in sba that defines the start of kmer a
kmer_sba_start_idx_b (int) – index in sba that defines the start of kmer b
kmer_len (Union[int, None], optional) – Length of the kmers to compare. If None, the end of the segment defines the longest kmers to compare.. Defaults to None.
- Raises:
AssertionError – there were no valid bases to compare
- Returns:
- comparison, last_kmer_index_compared
- comparison:
+1 = kmer_a > kmer_b 0 = kmer_a == kmer_b -1 = kmer_a < kmer_b
- last_kmer_index_compared: the kmer index of the last valid comparison done between two
bases. If a single base was compare, then this value will be 0.
- Return type:
tuple[int, int]
- genome_kmers.kmers.crispr_ngg_pam_filter(sba: array, sba_strand: str, kmer_sba_start_idx: int) bool[source]
Generate a filter that passes for all 23-mers ending in GG.
NOTE: no other checks on kmer validity are carried out.
- Raises:
ValueError – kmer extend beyond the size of the sequence byte array
- Returns:
whether kmer passes filter or not
- Return type:
bool
- genome_kmers.kmers.gen_kmer_gc_content_filter_func(min_allowed_gc_frac: float, max_allowed_gc_frac: float, kmer_len: int) Callable[source]
Generate a filter function that returns True if fraction GC content is between min_allowed_gc_frac and max_allowed_gc_frac.
NOTE: this function does not do extra checks to ensure that the kmer itself is valid (i.e. it does overflow over a boundary or beyond the sequence byte array, is made of valid bases, etc).
- Parameters:
min_allowed_gc_frac (float) – minimum allowed fraction GC content (inclusive). Must be in the range [0.0, 1.0] and be <= max_allowed_gc_frac.
max_allowed_gc_frac (float) – maximum allowed fraction GC content (inclusive). Must be in the range [0.0, 1.0] and be >= min_allowed_gc_frac.
kmer_len (int) – length of kmer
- Raises:
ValueError – min_allowed_gc_frac or max_allowed_gc_frac is invalid
- Returns:
filter
- Return type:
Callable
- genome_kmers.kmers.gen_kmer_homopolymer_filter_func(max_homopolymer_size: int, kmer_len: int) Callable[source]
Generate a filter function that passes if there is no homopolym of length greater than max_homopolymer_size.
NOTE: this function does not do extra checks to ensure that the kmer itself is valid (i.e. it does overflow over a boundary or beyond the sequence byte array, is made of valid bases, etc).
- Parameters:
max_homopolymer_size (int) – largest allowed homopolymer. Must be >= 1
kmer_len (int) – length of kmer
- Raises:
ValueError – max_homopolymer_size must be >= 2
- Returns:
filter
- Return type:
Callable
- genome_kmers.kmers.gen_kmer_length_filter_func(min_kmer_len: int) Callable[source]
Generates a filter that passes if kmer is of min_kmer_len, but otherwise fails.
- Parameters:
min_kmer_len (int) – minimum required kmer length
- Returns:
filter
- Return type:
Callable
- genome_kmers.kmers.gen_no_ambiguous_bases_filter(kmer_len: int) Callable[source]
Generate a filter that checks that only “A”, “T”, “G”, or “C” are present in the kmer.
- Parameters:
kmer_len (int) – kmer length
- Raises:
ValueError – end of segment or sequence byte array is reached before end of kmer
- Returns:
no_ambiguous_bases_filter
- Return type:
Callable
- genome_kmers.kmers.get_kmer_group_size_hist(sba: array, sba_strand: str, kmer_len: int | None, kmer_start_indices: array, kmer_comparison_func: Callable, kmer_filter_func: Callable, min_group_size: int = 1, max_group_size: int | None = None, max_counts_bin: int = 1000000) tuple[array, int][source]
Build a histogram of group sizes. counts_by_group_size[i] gives the number of groups of size i. Any group sizes > max_counts_bin will be placed in max_counts_bin. The total number of kmers is also calculated.
- Parameters:
sba (np.array) – sequence byte array
sba_strand (str) – “forward” or “reverse_complement”
kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.
kmer_start_indices (np.array) – kmer sequence byte array start indices
kmer_comparison_func (Callable) – function that returns the result of a two kmer comparison
kmer_filter_func (Callable) – function that returns true if a kmer passes all filters
min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.
max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.
max_counts_bin (int, optional) – largest group_size bin in the return counts_by_group_size array. Group sizes that exceed this value will be placed in this bin. Defaults to 1000000.
- Raises:
ValueError – max_counts_bin is invalid
- Returns:
counts_by_group_size, total_kmer_count
- Return type:
tuple[np.array, int]
- genome_kmers.kmers.get_kmer_info_group_size_only(kmer_num: int, kmer_sba_start_indices: array, sba: array, kmer_len: int | None, group_size_yielded: int, group_size_total: int) tuple[int, int, int][source]
Return only group_size_total without any processing.
- Parameters:
kmer_num (int) – kmer number
kmer_start_indices (np.array) – kmer sequence byte array start indices
sba (np.array) – sequence byte array
kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.
group_size_yielded (int) – number of kmers in the group that will be yielded
group_size_total (int) – number of kmers in the group in total
- Returns:
group_size_total
- Return type:
int
- genome_kmers.kmers.get_kmer_info_minimal(kmer_num: int, kmer_sba_start_indices: array, sba: array, kmer_len: int | None, group_size_yielded: int, group_size_total: int) tuple[int, int, int][source]
Return only basic kmer information without any processing. Used as an input to kmer_info_by_group_generator when only basic information is required.
- Parameters:
kmer_num (int) – kmer number
kmer_start_indices (np.array) – kmer sequence byte array start indices
sba (np.array) – sequence byte array
kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.
group_size_yielded (int) – number of kmers in the group that will be yielded
group_size_total (int) – number of kmers in the group in total
- Returns:
kmer_num, group_size_yielded, group_size_total
- Return type:
tuple[int, int, int]
- genome_kmers.kmers.kmer_filter_keep_all(sba: array, sba_strand: str, kmer_sba_start_idx: int)[source]
- genome_kmers.kmers.kmer_has_required_len(sba: array, sba_start_idx: int, min_kmer_len: int) bool[source]
Checks whether the kmer is of at least min_kmer_len before reaching the end of the segment.
- Parameters:
sba (np.array) – sequence byte array
sba_start_idx (int) – sequence byte array start index for the kmer
min_kmer_len (int) – minimum kmer length
- Return type:
bool
- genome_kmers.kmers.kmer_info_by_group_generator(sba: array, sba_strand: str, kmer_len: int | None, kmer_start_indices: array, kmer_comparison_func: Callable, kmer_filter_func: Callable, kmer_info_func: Callable, min_group_size: int = 1, max_group_size: int | None = None, yield_first_n: int | None = None) Generator[tuple, None, None][source]
Generator that yields the valid kmer information and total group size for all groups meeting requirements. A valid kmer is one that passes the provided kmer_filter_func. A group is defined as the set of identical kmers as defined by the kmer_comparison_func. The first “yield_first_n” kmers will be yielded if the group meets all provided requirements. It must have a total group size between min_group_size and max_group_size (inclusive). The kmer information that is yielded is customizable and defined by kmer_info_func.
- Parameters:
sba (np.array) – sequence byte array
sba_strand (str) – “forward” or “reverse_complement”
kmer_len (Union[int, None]) – length of kmer. If None, take the longest possible.
kmer_start_indices (np.array) – kmer sequence byte array start indices
kmer_comparison_func (Callable) – function that returns the result of a two kmer comparison
kmer_filter_func (Callable) – function that returns true if a kmer passes all filters
kmer_info_func (Callable) – function that returns a tuple with all the kmer information to yielded.
min_group_size (int, optional) – Smallest allowed group size. Defaults to 1.
max_group_size (Union[int, None], optional) – Largest allowed group size. If None, then there is no maximum group size. Defaults to None.
yield_first_n (Union[int, None], optional) – yield up to this many kmer_nums. Defaults to None.
- Raises:
ValueError – invalide min_group_size, max_group_size, or yield_first_n
- Yields:
Generator[tuple[list[int], int], None, None] – valid_kmer_nums_in_group, group_size
sequence_collection
- class genome_kmers.sequence_collection.SequenceCollection(fasta_file_path: Path | None = None, sequence_list: list[tuple[str, str]] | None = None, strands_to_load: str = 'forward')[source]
Bases:
objectHolds all the information contained within a fasta file in a format conducive to kmer sorting.
Terminology
sba: sequence byte array revcomp: reverse complement record: each header and its corresponding sequence is called a record. record_num is based on the order that records are read in. record_num does not change when reverse complemented segment: is the same as a record except that segment_num always starts leftmost. i.e. the sba end index for segment N is always less than the sba end index for segment M > N forward_sba_idx: index in forward sequence byte array revcomp_sba_idx: index in reverse complement sequence byte array forward_seq_idx: 0-based index for a sequence on the forward strand revcomp_seq_idx: 0-based index for a sequence on the reverse complement strand Forward strand -------------- record_num 0 1 2 segment_num 0 1 2 forward_record_name A B C forward_sba [-------]$[------------]$[--------------------] | | | | | | forward_sba_seg_starts 0 | 10 | 25 | forward_sba_seg_ends 8 23 46 Reverse complement strand ------------------------- revcomp_sba_seg_ends 21 36 46 revcomp_sba_seg_starts 0 | 23 | 38 | | | | | | | revcomp_sba [====================]$[============]$[=======] revcomp_record_name C B A segment_num 0 1 2 record_num 2 1 0Notes
- a "$" is placed between each sequence so that you can determine if you've reached the end of a sequence without referencing the seg_starts / seg_ends array. - the collection must contain at least one sequence - all sequences in the collection must have a length > 0 - duplicate record_names are not allowed
Members
forward_sba (np.array): forward sba (dtype=uint8) _forward_sba_seg_starts (np.array): value at index i gives the sba start index for the ith segment on the forward strand. (dtype=uint32) revcomp_sba (np.array): reverse complement sba (dtype=uint8) _revcomp_sba_seg_starts (np.array): value at index i gives the sba start index for the ith segment on the reverse complement strand. (dtype=uint32)
- generate_get_record_info_from_sba_index_func(one_based: bool = False) Callable[source]
Generate a function that returns record info when provided a sequence byte array index
- Parameters:
one_based (bool, optional) – Should sequence indices by one based. Defaults to False.
- Raises:
ValueError – sba_strand not recognized
- Returns:
get_record_info_from_sba_index
- Return type:
Callable
- get_record_loc_from_sba_index(sba_idx: int, sba_strand: str = None, one_based: bool = False) tuple[str, str, int][source]
Get the sequence location (strand, record_name, seq_idx) from the sequence byte array index
- Parameters:
sba_idx (int) – sequence byte array index
sba_strand (str, optional) – sequence byte array strand. If strands_loaded is “both”, then it must be specified as forward” or “reverse_complement”. Otherwise it is inferred. Defaults to None.
one_based (bool, optional) – whether seq index return be one-based. Defaults to False.
- Raises:
ValueError – _description_
- Returns:
_description_
- Return type:
tuple[str, str, int]
- get_record_name_from_sba_index(sba_idx: int, sba_strand: str = None) str[source]
Get the sequence record number from the sequence byte array index.
- Parameters:
sba_idx (int) – sequence byte array index
sba_strand (str, optional) – for which strand is the sba_idx defined (“forward” or “reverse_complement”). Must be defined when SequenceCollection has both strands loaded. If specified when only a single strand has been loaded, it will verify that it matches what is expected. If set to None, it will automatically detect the strand that was loaded. Defaults to None.
- Return type:
record_name (str)
- get_sba_start_end_indices_for_segment(segment_num: int, sba_strand: str = None) tuple[int, int][source]
Given a segment number (and optionally sba_strand), return the first sba index and last sba index of the segment.
- Parameters:
segment_num (int) – segment number
sba_strand (str, optional) – sequence byte array strand (“forward”, “reverse_complement”, or “both”). Defaults to None.
- Raises:
ValueError – segment_num out of bounds
- Returns:
sba_start_index, sba_end_index
- Return type:
tuple[int, int]
- get_segment_num_from_sba_index(sba_idx: int, sba_strand: str = None) int[source]
Get the segment number from the sequence byte array index defined on sba_strand (attempt to automatically detect the strand if not specified)
- Parameters:
sba_idx (int) – sequence byte array index
sba_strand (str, optional) – for which strand is the sba_idx defined (“forward” or “reverse_complement”). Must be defined when SequenceCollection has both strands loaded. If specified when only a single strand has been loaded, it will verify that it matches what is expected. If set to None, it will automatically detect the strand that was loaded. Defaults to None.
- Return type:
segment_num (int)
- iter_records(sba_strand: str = None)[source]
Yield (record_name, record_sba_start_idx, record_sba_end_idx) for all sequence records in order of increasing record_num. i.e. if records are ordered as “chr1”, “chr2”, “chr3”, they will be yielded in that same order regardless of whether the strand is “forward” or “reverse_complement”.
- Parameters:
sba_strand (str, optional) – for which strand to yield records. If self._strands_loaded == “both”, it must be specified. Otherwise it is determined from the loaded strand (“forward” or “reverse_complement”). Defaults to None.
- Yields:
(record_name, record_sba_start_idx, record_sba_end_idx)
- load(load_file_path: Path, format: str = 'hdf5')[source]
Load SequenceCollection object from saved file.
- Parameters:
load_file_path (Path) – path to file to load.
format (str, optional) – “hdf5” or “shelve”. Defaults to “hdf5”.
- Raises:
ValueError – format not recognized
- reverse_complement() array[source]
Reverse complement the sequence byte array. Only valid if a single strand is loaded.
- save(save_file_path: Path, mode: str = 'a', format: str = 'hdf5') None[source]
Save SequenceCollection object to file.
- Parameters:
save_file_path (Path) – path for saved file
format (str, optional) – “hdf5” or “shelve”. Defaults to “hdf5”.
mode (str, optional) – mode with which to open file and save information. “w” for write or “a” for append. Defaults to “a”.
- Raises:
ValueError – format not recognized
- genome_kmers.sequence_collection.bisect_right(a, x)[source]
NOTE: this is a minor modification to the copy and paste from the python standard library. It is not possible to use the python standard library version with the @jit decorator, which is required since functions that call this function need to use the @jit decorator.
Return the index where to insert item x in list a, assuming a is sorted.
The return value i is such that all e in a[:i] have e <= x, and all e in a[i:] have e > x. So if x already appears in the list, a.insert(i, x) will insert just after the rightmost x already there.
Optional args lo (default 0) and hi (default len(a)) bound the slice of a to be searched.
- genome_kmers.sequence_collection.get_forward_seq_idx(sba_idx: int, sba_strand: str, seg_sba_start_idx: int, seg_sba_end_idx: int, one_based: bool = False) int[source]
Get the forward sequence index given a sequence byte array index and segment start and and indices. Optionally returns a one-based sequence index.
- Parameters:
sba_idx (int) – sequence byte array index
sba_strand (str, optional) – sequence byte array strand. If strands_loaded is “both”, then it must be specified as forward” or “reverse_complement”. Otherwise it is inferred. Defaults to None.
seg_sba_start_idx (int) – sequence byte array index for the start of the segment
seg_sba_end_idx (int) – sequence byte array index for the of the segment
one_based (bool, optional) – whether seq index return be one-based. Defaults to False.
- Raises:
ValueError – if sba_idx, seg_sba_start_idx, or seg_sba_end_idx is not valid
ValueError – if sba_strand is not recognized
- Returns:
sequence forward strand index
- Return type:
int
- genome_kmers.sequence_collection.get_sba_start_end_indices_for_segment(segment_num: int, sba_strand: str, sba_seg_starts: array, len_sba: int) tuple[int, int][source]
Given a segment number (and optionally sba_strand), return the first sba index and last sba index of the segment.
- Parameters:
segment_num (int) – segment number
sba_strand (str, optional) – sequence byte array strand (“forward”, “reverse_complement”, or “both”). Defaults to None.
- Raises:
ValueError – segment_num out of bounds
- Returns:
sba_start_index, sba_end_index
- Return type:
tuple[int, int]
- genome_kmers.sequence_collection.get_segment_num_from_sba_index(sba_idx: int, sba_strand: str, sba_seg_starts: array) int[source]
Get the segment number from the sequence byte array index.
- Parameters:
sba_idx (int) – sequence byte array index
sba_strand (str) – “forward” or “reverse_complement”
sba_seg_starts (np.array) – sba indices for each segment start (dtype=np.unit32)
- Returns:
segment_num
- Return type:
int
- genome_kmers.sequence_collection.reverse_complement_sba(sba: array, complement_mapping_arr: array, inplace=False) array[source]
Reverse complement sequence byte array (sba) using the uint8 to uint8 mapping array (complement_mapping_arr). This function uses numba.jit for performance.
- Parameters:
sba (np.array) – sequence byte array (dtype=uint8)
complement_mapping_arr (np.array) – maps from sequence byte array value (uint8) to complement sequence byte array value (uint8)
inplace (bool, optional) – whether to perform in place or return a newly created array. Defaults to False.
- Returns:
reverse complemented sequence byte array
- Return type:
np.array