Simplify tuple delimiter regex patterns for LLM output fixing
• Consolidate 6 regex patterns into 3 • More efficient pattern matching • Clearer comments and examples • Same functionality, less code • Better maintainability
This commit is contained in:
parent
78eadc1d6c
commit
b9f80263b8
1 changed files with 12 additions and 39 deletions
|
|
@ -877,63 +877,36 @@ async def _process_extraction_result(
|
||||||
|
|
||||||
# Fix various forms of tuple_delimiter corruption from the LLM output.
|
# Fix various forms of tuple_delimiter corruption from the LLM output.
|
||||||
# It handles missing or replaced characters around the core delimiter.
|
# It handles missing or replaced characters around the core delimiter.
|
||||||
# 1. `<` or `>` may be missing.
|
# 1. There might be extra characters inserted between the bracket and pipeline.
|
||||||
# 2. `|` may be missing or replaced by another character.
|
# 2. `|` may be missing or replaced by another character.
|
||||||
# 3. There might be extra characters inserted.
|
# 3. Missing opening `<` or closing `>`
|
||||||
# 4. Missing opening `<` or closing `>`
|
|
||||||
# Example transformations:
|
# Example transformations:
|
||||||
# <SEP> -> <|SEP|>
|
# <X|SEP|> -> <|SEP|>, <|SEP|Y> -> <|SEP|>, <X|SEP|Y> -> <|SEP|> ((one extra characters outside pipes)
|
||||||
# <SEP|> -> <|SEP|> (where left | is missing)
|
# <SEP>, <SEP|>, <|SEP> -> <|SEP|> (missing one or both pipes)
|
||||||
# <|SEP> -> <|SEP|> (where right | is missing)
|
# <XSEP|> -> <|SEP|>, <|SEPX> -> <|SEP|> (where one | is replace by other charater)
|
||||||
# <XSEP|> -> <|SEP|> (where left | is replace by other charater)
|
# |SEP|> -> <|SEP|>, <|SEP| -> <|SEP|> (where one | is missing)
|
||||||
# <|SEPX> -> <|SEP|> (where right | is replace by other charater)
|
|
||||||
# <|SEP|X> -> <|SEP|> (where X is not '>')
|
|
||||||
# <XX|SEP|YY> -> <|SEP|> (handles extra characters)
|
|
||||||
# |SEP|> -> <|SEP|> (where left | is missing)
|
|
||||||
# <|SEP| -> <|SEP|> (where right | is missing)
|
|
||||||
|
|
||||||
escaped_delimiter_core = re.escape(
|
escaped_delimiter_core = re.escape(
|
||||||
tuple_delimiter[2:-2]
|
tuple_delimiter[2:-2]
|
||||||
) # Extract "SEP" from "<|SEP|>"
|
) # Extract "SEP" from "<|SEP|>"
|
||||||
|
|
||||||
# Fix: <SEP> -> <|SEP|> (missing pipes)
|
# Fix: <X|SEP|> -> <|SEP|>, <|SEP|Y> -> <|SEP|>, <X|SEP|Y> -> <|SEP|> (one extra characters outside pipes)
|
||||||
record = re.sub(
|
record = re.sub(
|
||||||
rf"<{escaped_delimiter_core}>",
|
rf"<.?\|{escaped_delimiter_core}\|.?>",
|
||||||
tuple_delimiter,
|
tuple_delimiter,
|
||||||
record,
|
record,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Fix: <SEP|> -> <|SEP|> (missing left pipe only)
|
# Fix: <SEP>, <SEP|>, <|SEP> -> <|SEP|> (missing one or both pipes)
|
||||||
record = re.sub(
|
record = re.sub(
|
||||||
rf"<{escaped_delimiter_core}\|>",
|
rf"<\|?{escaped_delimiter_core}\|?>",
|
||||||
tuple_delimiter,
|
tuple_delimiter,
|
||||||
record,
|
record,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Fix: <|SEP> -> <|SEP|> (missing right pipe only)
|
# Fix: <XSEP|> -> <|SEP|>, <|SEPX> -> <|SEP|> (one pipe is replaced by other character)
|
||||||
record = re.sub(
|
record = re.sub(
|
||||||
rf"<\|{escaped_delimiter_core}>",
|
rf"<[^|]{escaped_delimiter_core}\|>|<\|{escaped_delimiter_core}[^|]>",
|
||||||
tuple_delimiter,
|
|
||||||
record,
|
|
||||||
)
|
|
||||||
|
|
||||||
# Fix: <XSEP|> -> <|SEP|> (character X replacing first pipe)
|
|
||||||
record = re.sub(
|
|
||||||
rf"<[^|]+{escaped_delimiter_core}\|>",
|
|
||||||
tuple_delimiter,
|
|
||||||
record,
|
|
||||||
)
|
|
||||||
|
|
||||||
# Fix: <|SEPX> -> <|SEP|> (character X replacing second pipe)
|
|
||||||
record = re.sub(
|
|
||||||
rf"<\|{escaped_delimiter_core}[^|]+>",
|
|
||||||
tuple_delimiter,
|
|
||||||
record,
|
|
||||||
)
|
|
||||||
|
|
||||||
# Fix: <XX|SEP|YY> -> <|SEP|> (extra characters around, but preserve correct delimiters)
|
|
||||||
record = re.sub(
|
|
||||||
rf"<[^<>]+\|{escaped_delimiter_core}\|[^<>]+>",
|
|
||||||
tuple_delimiter,
|
tuple_delimiter,
|
||||||
record,
|
record,
|
||||||
)
|
)
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue