Skip to content

关于TinyRAG get_chunk函数的疑惑 #38

@Yoda-wu

Description

@Yoda-wu
            line = line.replace(' ', '')
            line_len = len(enc.encode(line))
            if line_len > max_token_len:
                # 如果单行长度就超过限制,则将其分割成多个块
                num_chunks = (line_len + token_len - 1) // token_len
                for i in range(num_chunks):
                    start = i * token_len
                    end = start + token_len
                    # 避免跨单词分割
                    while not line[start:end].rstrip().isspace():
                        start += 1
                        end += 1
                        if start >= line_len:
                            break
                    curr_chunk = curr_chunk[-cover_content:] + line[start:end]
                    chunk_text.append(curr_chunk)
                # 处理最后一个块
                start = (num_chunks - 1) * token_len
                curr_chunk = curr_chunk[-cover_content:] + line[start:end]
                chunk_text.append(curr_chunk)

上述处理单行超过限制的处理逻辑,是不是有问题,求解答🙏
由于line进行过replace,去除了行内的空白,当进行while循环的时候,while not line[start:end].rstrip().isspace()会不会一直不成立?导致最后的line[start:end]是空白字符串?

示例代码:

line = "Thisisaverylongsentence"
start = 0
end = 8
line_len = len(line)

while not line[start:end].rstrip().isspace():
    print(f"start: {start}, end: {end}, line[start:end]: '{line[start:end]}'")
    start += 1
    end += 1
    if start >= line_len:
        break

print(f"执行完循环后,line[start:end] 为: '{line[start:end]}'")
"""输出
start: 0, end: 8, line[start:end]: 'Thisisav'
start: 1, end: 9, line[start:end]: 'hisisave'
start: 2, end: 10, line[start:end]: 'isisaver'
start: 3, end: 11, line[start:end]: 'sisavery'
start: 4, end: 12, line[start:end]: 'isaveryl'
start: 5, end: 13, line[start:end]: 'saverylo'
start: 6, end: 14, line[start:end]: 'averylon'
start: 7, end: 15, line[start:end]: 'verylong'
start: 8, end: 16, line[start:end]: 'erylongs'
start: 9, end: 17, line[start:end]: 'rylongse'
start: 10, end: 18, line[start:end]: 'ylongsen'
start: 11, end: 19, line[start:end]: 'longsent'
start: 12, end: 20, line[start:end]: 'ongsente'
start: 13, end: 21, line[start:end]: 'ngsenten'
start: 14, end: 22, line[start:end]: 'gsentenc'
start: 15, end: 23, line[start:end]: 'sentence'
start: 16, end: 24, line[start:end]: 'entence'
start: 17, end: 25, line[start:end]: 'ntence'
start: 18, end: 26, line[start:end]: 'tence'
start: 19, end: 27, line[start:end]: 'ence'
start: 20, end: 28, line[start:end]: 'nce'
start: 21, end: 29, line[start:end]: 'ce'
start: 22, end: 30, line[start:end]: 'e'
"""
执行完循环后line[start:end] : ''

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions