some background: currently I am querying 4Mio rows (with 50 columns) from a MS SQL server with dbatools into a PSObject (in Batch 10.000 rows each query), processing the data with PowerShell (a lot of RegEx stuff) and writing back into a MariaDb with SimplySql. In average i get approx. 150 rows/sec. Had to use a lot of tricks (Net's Stringbuilder etc.) for this performance, its not that bad imho
As new requirements I want to detect the language of some text cells and I have to remove personal data (name & address). I found some good python libs (spacy and pycld2) for that purpose. I made tests with pycld2 - pretty good detection.
Simplified code for clarification (hint:I am a python noob):
#get data from MS SQL
$data = Invoke-DbaQuery -SqlInstance $Connection -Query $Query -As PSObject -QueryTimeout 1800
for ($i=0;$i -lt $data.length;$i++){
#do a lot of other stuff here
#...
#finally make lang detection
if ($LangDetect.IsPresent){
$strLang = $tCaseDescription -replace "([^\p{L}\p{N}_\.\s]|`t|`n|`r)+",""
$arg = "import pycld2 as cld2; isReliable, textBytesFound, details = cld2.detect('" + $strLang + "', isPlainText = True, bestEffort = True);print(details[0][1])"
$tCaseLang = & $Env:Programfiles\Python39\python.exe -c $arg
} else {
$tCaseLang = ''
}
}
#write to MariaDB
Invoke-SqlUpdate -ConnectionName $ConnectionName -Query $Query
This python call each time works, but it destroys the performance (12rows/sec) due the loop-call and importing pycld2 lib each time. So, this is a lame solution :) In addition, as mentioned above - I want to use spacy - where some more columns has to parsed for getting rid of the personal data.
I am not sure, if I have the mood to convert the whole PS Parser to python :|
I believe, a better solution might be to pass the whole PSObject from PowerShell to python (before the PS loop starts) and return it as well as PSObject - after it has been processed in python - but I don't know, how I can realize this with python / python function.
What would be your approach/suggestions, any other ideas? Thanks :)
The following simplified example shows you how you can pass multiple [pscustomobject]
([psobject]
) instances from PowerShell to a Python script (passed as a string via -c
in this case):
by using JSON as the serialization format, via ConvertTo-Json
...
... and passing that JSON via the pipeline, which Python can read via stdin (standard input).
Important:
Character encoding:
PowerShell uses the encoding specified in the $OutputEncoding
preference variable when sending data to external programs (such as Python), which commendably defaults to BOM-less UTF-8 in PowerShell [Core] v6+, but regrettably to ASCII(!) in Windows PowerShell.
Just like PowerShell limits you to sending text to an external program, it also invariably interprets what it receives as text, namely based on the encoding stored in [Console]::OutputEncoding
; regrettably, both PowerShell editions as of this writing default to the system's OEM code page.
To both send and receive (BOM-less) UTF-8 in both PowerShell editions, (temporarily) set $OutputEncoding
and [Console]::OutputEncoding
as follows:
$OutputEncoding = [Console]::OutputEncoding = [System.Text.Utf8Encoding]::new($false)
If you want your Python script to also output objects, again consider using JSON, which on the PowerShell you can parse into objects with ConvertFrom-Json
.
# Sample input objects.
$data = [pscustomobject] @{ one = 1; two = 2 }, [pscustomobject] @{ one = 10; two = 20 }
# Convert to JSON and pipe to Python.
ConvertTo-Json $data | python -c @'
import sys, json
# Parse the JSON passed via stdin into a list of dictionaries.
dicts = json.load(sys.stdin)
# Sample processing: print the 'one' entry of each dict.
for dict in dicts:
print(dict['one'])
'@
If the data to pass is a collection of single-line strings, you don't need JSON:
$data = 'foo', 'bar', 'baz'
$data | python -c @'
import sys
# Sample processing: print each stdin input line enclosed in [...]
for line in sys.stdin:
print('[' + line.rstrip('\r\n') + ']')
'@