python 3 regex not finding confirmed matches -


so i'm trying parse bunch of citations text file using re module in python 3.4 (on, if matters, mac running mavericks). here's minimal code. note there 2 commented lines: represent 2 alternative searches. (obviously, little one, r'rawls', 1 works)

def makereflist(reffile):     print(reffile)     # namepattern = r'(^[a-z1][a-za-z1]*-?[a-za-z1]*),.*( \(?\d\d\d\d[a-z]?[.)])'     # namepattern = r'rawls'     refstupleslist = re.findall(namepattern, reffile, re.multiline)     print(refstupleslist) 

the string in question ugly, , stuck in gist: https://gist.github.com/paultopia/6c48c398a42d4834f2ae

as noted, search string r'rawls' produces expected output ['rawls', 'rawls']. however, other search string produces empty list.

i've confirmed regex (partially) works using regex101 tester. confirmation here: https://regex101.com/r/kp4no0/1 -- match expect match. since works in tester, should work in code, right?

(n.b. copied text terminal output first print command, manually replaced \n characters in string carriage returns regex101.)

one possible issue python has appended bytecode flag (is little b called "flag?") string. artifact of attempt convert text utf-8 ascii, , haven't figured out how make go away.

yet re able parse strings in form. know because i'm converting two text files utf-8 ascii, , following code works fine on other string, converted other text file, has little b in front of it:

def makecitelist(citefile):     print(citefile)     citepattern = r'[\s(][a-z1][a-za-z1]*-?[a-za-z1]*[ ,]? \(?\d\d\d\d[a-z]?[\s.,)]'     rawcitelist = re.findall(citepattern, citefile)     cleancitelist = cleanup(rawcitelist)     finalcitelist = list(set(cleancitelist))     print(finalcitelist)     return(finalcitelist) 

the other chunk of text, code above matches correctly: https://gist.github.com/paultopia/a12eba2752638389b2ee

the hypothesis can come first, broken, regex expression puking on combination of newline characters , string being treated byte object, though a) know regex correct newlines (because, confirmation linked regex101), , b) know it's matching strings (because, confirmation successful match on other string).

if that's true, though, don't know it.

thus, questions:

1) hypothesis right it's combination of newlines , b blows regex? if not, is? 2) how fix that?
a) replace newlines in string? b) rewrite regex somehow? c) somehow rid of b , make normal string again? (how?)

thanks!

addition

in case problem need fix upstream, here's code i'm using text files , convert ascii, replacing non-ascii characters:

this function gets called on utf-8 .txt files saved textwrangler in mavericks

def makecorpoi(citefile, reffile):     citebox = open(citefile, 'r')     refbox = open(reffile, 'r')     citecorpus = citebox.read()     refcorpus = refbox.read()     citebox.close()     refbox.close()     corpoi = [str(citecorpus), str(refcorpus)]     return corpoi 

and function gets called on each element of list above function returns.

def conv2ascii(bigstring):      def convhandler(error):         return ('1foreign', error.start + 1)     codecs.register_error('foreign', convhandler)     bigstring = bigstring.encode('ascii', 'foreign')     stringstring = str(bigstring)     return stringstring 

aah. i've tracked down , answered own question. apparently 1 needs call kind of encode method on decoded thing. following code produces actual string, newlines , everything, out other end (though have fix bunch of other bugs before can figure out if final output expected):

def conv2ascii(bigstring):      def convhandler(error):         return ('1foreign', error.start + 1)     codecs.register_error('foreign', convhandler)     bigstring = bigstring.encode('ascii', 'foreign')     newstring = bigstring.decode('ascii', 'foreign')     return newstring 

apparently str() function doesn't same job, reasons mysterious me. despite answer here how make new line commands work in .txt file opened internet? suggests does.


Comments